I've been thinking about assessing the cost of a coevolutionary algorithm. I think that the lack of an explicit objective function makes cost assessment tricky in different way from the ordinary issues surrounding assessing cost in EC. I was hoping we could get some list of references started and some discussion on this point. And since there aren't any other posts yet...
Here are three questions I had in mind; I elaborate on them below.
1. What is an appropriate tradeoff between per-generation evaluational costs and the aim of finding good solutions?
2. Which makes more sense: measuring the performance gains achieved after N fitness evaluations? Or, asking how many fitnesses evaluations are needed to reach performance level X?
3. As far as counting and comparing the number of fitness evaluations, do notions like computational complexity really make sense when we're solving an instance of a problem vs. a class of problems?
Question 1. I wanted to mention a paper I wrote for GECCO 2005, On Identifying Global Optima in Cooperative Coevolution (you can get that off my home page http://www.cs.brandeis.edu/~abucci/). I was struck by the fact that there were cooperative coevolutionary algorithms (CCEAs) which, when run on certain functions (called MTQ functions) will almost never find the global optima. I saw that observation in a paper by Liviu Panait and Sean Luke, and rigged up an example, based on one of their examples, to amplify the effect.
Naturally that's a straw man, but it highlights my first question. The straw man CCEA I used was considered fast in the sense that it used very few fitness evaluations per generation. But, the algorithm almost never found global optima, so in what sense is that fast? By that measure, an algorithm which does nothing but spit out a random individual is even faster because it uses no fitness evaluations at all.
More generally, when there's the possibility of an algorithm 'overspecializing,' we face a similar question. If an algorithm rapidly overspecializes on a brittle solution, is it really fast? It may use only a few fitness evaluations, but so what?
So I see addressing this question as one of balancing the goals of the algorithm against how many fitness evaluations are needed to get there. How do we do that? Which leads me to...
Question 2. Do we give an algorithm a budget of N fitness evaluations and ask what level of performance it can achieve with that budget? This seems OK on the surface, but when it comes to comparing two algorithms, things get trickier. What if algorithm A is better than B for the first N evaluations used, but then they cross and B is better than A from N+1 onward? For instance, B may be assembling knowledge (like an archive) which makes it achieve small fitness gains early in order to achieve faster fitness gains later. There's a kind of horizon effect here, to put it still differently.
Do we assert a high-water mark level of performance, X, and ask how long it takes each algorithm to get there? First, notice that this will give a different ranking of algorithms than the previous one. But it may too be infeasible. Perhaps some (or all) algorithms cannot reach X. Particularly when we do not know much about the problem and are using coevolution to explore rather than to solve, we may not know where to set X. So this type of comparison has its drawbacks also.
Is there a hybrid? Is the appropriate type of comparison relative to the task?
Worse still, exactly how are we assessing performance, anyway? If we have an offline metric, good. But if not, if the algorithm itself is coevolving its fitness function, then any performance gain we see is relative to the state of the algorithm. Which means that algorithm A may look like it made good gains when it did so against a wimpy fitness function. Algorithm B may look like it made little progress, but since it had simultaneously developed a strong fitness function, it actually made significant gains there. I argue that Fig 5 in my 2003 GECCO paper Focusing versus Transitivity. How do we compare two algorithms in light of this possibility?
Question 3. Let's say you want to coevolve tic-tac-toe players, and you find it takes 10 million games of tic-tac-toe to find one which never loses to your test players. Was that fast or slow? Computational complexity concepts like big-O notation doesn't help here. It's not that there is a class of tic-tac-toe games which are scaling, so that we can say our algorithm requires O(n^2) games to find a good player. Arguably, we can scale some games like Go, but even then it's not clear that we're solving "the same" problem when we do that (if you've played both 9x9 and 19x19 Go, you know these are different games for human beings at least).
You can scale population size, and you often see claims that an algorithm is fast because it uses "only" O(n) fitness evaluations per generation, say. This all seems moot in light of the questions I raised above, however.
Hi Anthony - thanks for being first cab off the rank with an interesting issue! These types of questions are important as we try to become more scientific about EC research.
I agree with your answer "The appropriate type of comparison is relative to the task". e.g. If an algorithm is to be deployed in some industrial process control application, where it will be run periodically, will have a set budget of time, and must achieve a certain level of performance, then an approriate measure might be, say the probability of failing to meet the required performance level. Or if a 100% reliable but less optimal fallback method is available, it might be, say, the mean amount by which the algorithm beats the fallback method. Again, if the task is, say a once-off engineering design task, then the appropriate measure might be the expected fitness that can be achieved within a given budget of evaluations, or maybe the expected maximum fitness over a number of shorter runs in a given time.
If the algorithm is being considered without a specific task in mind, this poses the question: which performance measure(s) should be used? Maybe a reasonable answer here would be to provide more than one measure (perhaps a standard list to suit different tasks), so that users of the algorithm can choose the measure most of interest to them.
Another interesting point you make is that in the case of coevolution, fitness is only defined relative to the current population, so generally, measures like those above, based on comparing fitness values calculated relative to different populations are not meaningful.
One answer might be to calculate some kind of "external fitness", by testing evolved solutions in a number of "typical" environments (e.g. testing a co-evolved Othello player against a couple of standard players of known strength). But this has some problems:
* who is to say what a typical environment is? * if a typical environment can be agreed, why bother with co- evolution? Why not just use it to measure fitness for a standard EA? * sometimes one may not be interested in any kind of "external fitness" anyway, but may be intersted, for example, in the dynamic behaviour of the system in different circumstances.
Maybe these questions have been asked and answered before - I'm sure someone out there will be able to point us to relevant literature on co-evolution??
On Apr 18, 11:53 pm, abucci <abu...@gmail.com> wrote:
> I've been thinking about assessing the cost of a coevolutionary > algorithm. I think that the lack of an explicit objective function > makes cost assessment tricky in different way from the ordinary issues > surrounding assessing cost in EC. I was hoping we could get some list > of references started and some discussion on > this point. And since there aren't any other posts yet...
> Here are three questions I had in mind; I elaborate on them below.
> 1. What is an appropriate tradeoff between per-generation evaluational > costs and the aim of finding good solutions?
> 2. Which makes more sense: measuring the performance gains achieved > after N fitness evaluations? Or, asking how many fitnesses > evaluations are needed to reach performance level X?
> 3. As far as counting and comparing the number of fitness evaluations, > do notions like computational complexity really make sense when we're > solving an instance of a problem vs. a class of problems?
> Question 1. > I wanted to mention a paper I wrote for GECCO 2005, On Identifying > Global Optima in Cooperative Coevolution (you can get that off my home > pagehttp://www.cs.brandeis.edu/~abucci/). I was struck by the fact > that there were cooperative coevolutionary algorithms (CCEAs) which, > when run on certain functions (called MTQ functions) will almost never > find the global optima. I saw that observation in a paper by Liviu > Panait and Sean Luke, and rigged up an example, based on one of their > examples, to amplify the effect.
> Naturally that's a straw man, but it highlights my first question. > The straw man CCEA I used was considered fast in the sense that it > used very few fitness evaluations per generation. But, the algorithm > almost never found global optima, so in what sense is that fast? By > that measure, an algorithm which does nothing but spit out a random > individual is even faster because it uses no fitness evaluations at > all.
> More generally, when there's the possibility of an algorithm > 'overspecializing,' we face a similar question. If an algorithm > rapidly overspecializes on a brittle solution, is it really fast? It > may use only a few fitness evaluations, but so what?
> So I see addressing this question as one of balancing the goals of the > algorithm against how many fitness evaluations are needed to get > there. How do we do that? Which leads me to...
> Question 2. > Do we give an algorithm a budget of N fitness evaluations and ask what > level of performance it can achieve with that budget? This seems OK > on the surface, but when it comes to comparing two algorithms, things > get trickier. What if algorithm A is better than B for the first N > evaluations used, but then they cross and B is better than A from N+1 > onward? For instance, B may be assembling knowledge (like an archive) > which makes it achieve small fitness gains early in order to achieve > faster fitness gains later. There's a kind of horizon effect here, to > put it still differently.
> Do we assert a high-water mark level of performance, X, and ask how > long it takes each algorithm to get there? First, notice that this > will give a different ranking of algorithms than the previous one. > But it may too be infeasible. Perhaps some (or all) algorithms cannot > reach X. Particularly when we do not know much about the problem and > are using coevolution to explore rather than to solve, we may not know > where to set X. So this type of comparison has its drawbacks also.
> Is there a hybrid? Is the appropriate type of comparison relative to > the task?
> Worse still, exactly how are we assessing performance, anyway? If we > have an offline metric, good. But if not, if the algorithm itself is > coevolving its fitness function, then any performance gain we see is > relative to the state of the algorithm. Which means that algorithm A > may look like it made good gains when it did so against a wimpy > fitness function. Algorithm B may look like it made little progress, > but since it had simultaneously developed a strong fitness function, > it actually made significant gains there. I argue that Fig 5 in my > 2003 GECCO paper Focusing versus Transitivity. How do we compare two > algorithms in light of this possibility?
> Question 3. > Let's say you want to coevolve tic-tac-toe players, and you find it > takes 10 million games of tic-tac-toe to find one which never loses to > your test players. Was that fast or slow? Computational complexity > concepts like big-O notation doesn't help here. It's not that there > is a class of tic-tac-toe games which are scaling, so that we can say > our algorithm requires O(n^2) games to find a good player. Arguably, > we can scale some games like Go, but even then it's not clear that > we're solving "the same" problem when we do that (if you've played > both 9x9 and 19x19 Go, you know these are different games for human > beings at least).
> You can scale population size, and you often see claims that an > algorithm is fast because it uses "only" O(n) fitness evaluations per > generation, say. This all seems moot in light of the questions I > raised above, however.
Maybe an interesting perspective on this issue is to look at some of the more subjective intuitions that motivate coevolution aside from performance metrics. For example, at least some of the intuitive excitement with coevolution stems from the possibility of simply producing something that is impossible to produce in any other way. Of course you can question how we would know that a particular result was otherwise impossible, or how we can know what exactly what was produced without some kind of measurement, but again, I think intuitively the idea is just that it would just be obvious.
For example, if my computer could play against itself for a few days until it became a world champion Go player, ambiguities in measurement would not much matter. In fact, it is hard to imagine how else my computer could become that way on its own. There is no external objective policy that can be used as an evaluation function to produce equivalent results. Perhaps the method need not be coevolution (e.g. it could be value-function RL self-play), but the potential of self- play is the intuitive core of why performance measurement/comparison is at some level, at least subjectively, only a side-issue.
I am not by any means suggesting we ignore performance metrics as an important practical issue, and we should think about them. At the same time, it's interesting that discussions of performance in coevolution can sometimes obfuscate the real reason coevolution is so appealing. After all, does it really matter if one method is proven to be objectively "faster" than another if the other method plays up to world champion status in whatever amount of time? Furthermore, barring a result that reaches record-shattering performance, does it really inform the direction of future research if a comparison between two coevolution approaches is objectively definitive one way or another? That is, if algorithm A is objectively faster, or reaches a goal point sooner than algorithm B, does that mean anything important about A vs. B if our ultimate goal is to evolve a world champion?
Or even more extreme, what about domains with no global optima, i.e. open-ended domains? In such domains the main issue is not cost, but rather whether either algorithm can keep improving indefinitely.
Like I said, I'd like to pin down answers to Anthony's questions too, but they lead to even deeper questions.
Anthony and I have been discussing some issues in personal correspondence, and some of them have touched on these issues; however, I wanted to chime in on the public forum because I think my views may be quite different from Anthony's.
I'll endeavor to be brief (which is not in my nature).
> 1. What is an appropriate tradeoff between per-generation evaluational > costs and the aim of finding good solutions?
> ... > More generally, when there's the possibility of an algorithm > 'overspecializing,' we face a similar question. If an algorithm > rapidly overspecializes on a brittle solution, is it really fast? It > may use only a few fitness evaluations, but so what?
> So I see addressing this question as one of balancing the goals of the > algorithm against how many fitness evaluations are needed to get > there. How do we do that? Which leads me to...
The question is certainly one of goals. An algorithm that performs well on a handful of problem instances, but fails utterly to solve (perhaps most) other problems may be exactly what is wanted. It is hard to say up front ... I don't think there is a "right answer" here.
I think the key is to be clear what the goals are up front. Perhaps that's where we fall down. In traditional algorithm analysis, it is normal to talk about completeness and efficiency and let the users of the algorithms decide for themselves. I think that's an apt analogy here.
It doesn't seem strange to me that as a community we might offer practitioners the following message: This algorithm monotonically approaches the following solution concept but may (for many problems) require a fair number of computational resources (memory and/or CPU), while this other algorithm gives you no such promises but uses far fewer resources and happens to do well on a variety of problems with the following properties.
> 2. Which makes more sense: measuring the performance gains achieved > after N fitness evaluations? Or, asking how many fitnesses > evaluations are needed to reach performance level X? > Question 2. > Do we give an algorithm a budget of N fitness evaluations and ask what > level of performance it can achieve with that budget? This seems OK > on the surface, but when it comes to comparing two algorithms, things > get trickier. ...
> Do we assert a high-water mark level of performance, X, and ask how > long it takes each algorithm to get there? First, notice that this > will give a different ranking of algorithms than the previous one. > But it may too be infeasible. ...
Again, I don't think there is a "right answer" here. This is an age- old question in the optimization literature, no?
It seems both views have values and both views have drawbacks. To me they underscore very different kinds of goals on the part of the practitioner: 1.) Am I interested in rapidly finding a solution? Or, 2.) Am I interested in finding the highest quality solution I can get in some period of time. It depends on the computation constraints of the user, the type of problem they are solving, etc.
It cannot be answered generally.
> Is there a hybrid? Is the appropriate type of comparison relative to > the task?
Perhaps a kind of Pareto front in algorithm space? 8^)
> 3. As far as counting and comparing the number of fitness evaluations, > do notions like computational complexity really make sense when we're > solving an instance of a problem vs. a class of problems? > Question 3. > ...Computational complexity > concepts like big-O notation doesn't help here. It's not that there > is a class of tic-tac-toe games which are scaling, so that we can say > our algorithm requires O(n^2) games to find a good player....
Gosh, I don't agree at all here. I think Big-O could absolutely be useful ... but I wouldn't talk in terms of scaling the game size or the population size, rather the potential strategy set (that is, after all, what is being optimized).
Once one takes the time to write down a solution concept and an algorithm (with representation), one can ask the question: How many evaluations do we expect before we produce a solution that meets our SC definition?
Consider Richard's 2001 GECCO paper: He mimics a binary representation, assuming unitation is used for each dimension, then applies the numbers game. If we pick a solution concept and apply a very simple CEA (perhaps something like Pablo's CEA in his 2005 GECCO paper), we can ask the question: How does the size of that bit string affect how many evaluations it takes to find a solution according to the SC definition.
Not only do I think it can be done, but I believe it would be very useful to know how the size of the strategy set affects a CEA's ability to find such solutions. And I think it has everything to do with (one aspect) of CEA performance.