KataGo V1.3

Limeztone · **#121**

lightvector wrote:

Saying a fixed number of playouts you used per move is NOT enough to give a constant hardware-independent strength. You also have to specify how many threads you used to generate that many playouts.

I don't get this...
Are you actually saying that the same net with the same maxPlayouts could be different in strength depending on the number of threads (or executed on different hardware)?

jann · **#122**

Limeztone wrote:

As I understand visits vs playouts is that if you clear the tree for every move made, visits and playouts become the same.

Thus a big change for a playout based test (which was affected by a random search bonus without this).

Quote:

the same net with the same maxPlayouts could be different in strength depending on the number of threads (or executed on different hardware)

More search threads means weaker search (less freedom in which nodes to visit/expand).

inbae · **#123**

jann wrote:

For example, if you clear the tree each move, fixed playout tests are heavily affected (the same amount of playouts / work will do less effective search) while fixed visit tests are less so (single threaded at least).

Yes, clearing the search tree will certainly affect the results, but I don't think that is realistic in match conditions.

jann wrote:

Another example is when you find an otherwise weaker side ahead, because of higher extent of tree reuse (thus effectively more but weaker search). Then repeat the test in a different visit/playout range, and find that these two factors are now less compensate each other, and now the other side comes out ahead.

In this very example, I would say that a stronger engine is weakened by not effectively reusing tree. At the end of the day, this boils down to the question that which represents the strength better between fixed playouts or fixed visits. And due to the aforementioned reasons, I think a fixed playouts test reflects real world strength more correctly, since policy sharpness is a direct result of NN inference.

jann · **#124**

A wider test that allows more factors to affect the result will certainly be closer to being called a "real world" test (which is usually a situation with many affecting factors - hence with very hard to interpret results!).

But I think you missed the point of the last example, where your playout results may even trick you. It is known that more search affect different nets/engines differently (stronger tend to benefit more). A weaker net with sharper policy that allows more tree reuse trades search quality for (effective) quantity. At 1000 playouts it may win your test, and you may think it is "stronger in real world", but at 10000 playouts (where search quality starts to matter more) it may lose.

Thus your results will be less robust or consistent/representative across various real world scenarios (similarly like if you allowed hw factors to affect your test). With a visit based test, whichever side wins at 1000 visits will likely also win at 10000 visits.

inbae · **#125**

jann wrote:

A weaker net with sharper policy that allows more tree reuse trades search quality for (effective) quantity. At 1000 playouts it may win your test, and you may think it is "stronger in real world", but at 10000 playouts (where search quality starts to matter more) it may lose.

Thus your results will be less robust or consistent/representative across various real world scenarios (similarly like if you allowed hw factors to affect your test). With a visit based test, whichever side wins at 1000 visits will likely also win at 10000 visits.

A lower visits test does not necessarily correlate with higher visit tests: Networks scale differently anyways, and for different visit/playout counts, additional tests are required. This is clearer when we consider that the value head influences more for deeper searches. For example, a relative scaling test by Friday9i is an example of different scaling of networks with fixed visits.

jann · **#126**

The margin of victory will be different, but not the winner (assuming identical visits - unlike with identical playouts).

Those linked tests used non-identical visits, thus widened the test up to a new factor (scalability). But this was on purpose there, since the test was not about raw strength but scalability itself.

inbae · **#127**

jann wrote:

The margin of victory will be different, but not the winner (assuming identical visits - unlike with identical playouts).

Those linked tests used non-identical visits, thus widened the test up to a new factor (scalability). But this was on purpose there, since the test was not about raw strength but scalability itself.

The point is the difference in scalability. If two networks scale differently, you cannot guarantee the winner at lower visits necessarily would win as well at higher visits. I have no idea why you are confident that the winner will not change here. Moreover, one sometimes wants to measure the margin of victory (or Elo rating difference) as well.

jann · **#128**

The difference in scalability means that net A needs 1.5x more visits around 1000 visit (to compensate for being weaker) but 2.5x more around 10000 visits. In both cases it is weaker than B, and would (obviously) lose at 1.0x visits.

Network strengths will not (or rarely) swap, what usually happens is another factor (like more search) may compensate for raw strength difference.

inbae · **#129**

jann wrote:

The difference in scalability means that net A needs 1.5x more visits around 1000 visit (to compensate for being weaker) but 2.5x more around 10000 visits. In both cases it is weaker than B, and would (obviously) lose at 1.0x visits.

Network strengths will not (or rarely) swap, what usually happens is another factor (like more search) may compensate for raw strength difference.

The true implication of the scaling test is that the rating of networks increases with different slopes with respect to logarithm of playouts (if we assume a naive approximation that Elo rating increases linearly with log(playouts)). It suggests that a better scaling network can eventually overcome another network of worse scaling given enough playouts. This will be especially the case for a network with a better value head: Given more playouts, the search will be influenced by the value head more. And such networks can be results of different weights in the loss function during training.

jann · **#130**

inbae wrote:

It suggests that a better scaling network can eventually overcome another network of worse scaling given enough playouts.

Sure, that's why more search usually helps the stronger but slower side. But note that even in your linked graph curves usually don't cross the line of "1" (which would happen if the identical-visit winner could easily swap). Being stronger somewhere at identical visits normally determines the rest of the curve, the only question is the slope (ie. when will the extra search required to compensate be more than what's available from eg. the speed difference).

inbae · **#131**

jann wrote:

Being stronger somewhere at identical visits normally determines the rest of the curve

Not necessarily, I suppose. For example, there are U-shaped curve, where a weaker network benefits from more visits at sweet spots, but falls of at higher visits due to scaling. However, another group consists of purple curves, where ELFv1 is scaling better than LZ18x, though the ratio=1 line was not crossed here. We are clearly seeing different scalings of networks. Judging the strength of a network from a certain visit counts is as dangerous as testing with 1 playout only, and the strength should be tested in terms of [network, playouts (or visits), number of threads] for example - there is nothing like an absolute measure of the "strength of a network".

And still, this discussion does not justify arguments such as

jann wrote:

A weaker net with sharper policy that allows more tree reuse trades search quality for (effective) quantity. At 1000 playouts it may win your test, and you may think it is "stronger in real world", but at 10000 playouts (where search quality starts to matter more) it may lose.

Thus your results will be less robust or consistent/representative across various real world scenarios (similarly like if you allowed hw factors to affect your test). With a visit based test, whichever side wins at 1000 visits will likely also win at 10000 visits.

jann · **#132**

Everything is possible, but not everything is (equally) probable. In any case, a visit based test is more likely to be consistent across visit ranges.

inbae wrote:

And still, this discussion does not justify arguments such as

With a playout based test, the side that tend to support more tree reuse has, say, a constant 1.5x effective search advantage. This is similar to a smaller, weaker but faster net, which (with proportional visits, or time based test) can win low search matches but lose high search matches.

inbae · **#133**

jann wrote:

With a playout based test, the side that tend to support more tree reuse has, say, a constant 1.5x effective search advantage. This is similar to a smaller, weaker but faster net, which (with proportional visits, or time based test) can win low search matches but lose high search matches.

Such a search advantage can be converted into visits. Say it is 1.5x: then it will be something like 1500 visits for network A vs 1000 visits for network B. Do you imply that 1000 vs 1000 visits will be consistent with 10000 vs 10000 visits, but 1500 vs 1000 won't be so with 15000 vs 10000, for example?

And if a weaker-at-lower-playouts net can manage to win at higher playouts, let it be. It is ultimately what really matters, and the strength of a network cannot be thought of without considering playouts (or visits).

jann · **#134**

inbae wrote:

Do you imply that 1000 vs 1000 visits will be consistent with 10000 vs 10000 visits, but 1500 vs 1000 won't be so with 15000 vs 10000, for example?

Yes, this is also the exact meaning of the scalability graph you linked. (With some effort you may even find examples for these particular numbers there - crossing the 1.5 line but not the 1.0 line.)

inbae · **#135**

jann wrote:

inbae wrote:

Do you imply that 1000 vs 1000 visits will be consistent with 10000 vs 10000 visits, but 1500 vs 1000 won't be so with 15000 vs 10000, for example?

Yes, this is also the exact meaning of the scalability graph you linked.

You are right about the latter, but the former is less justified. As both policy and value heads are involved in the search, networks should behave differently with different playouts/visits. You can think of very extreme cases such as an abysmal value head, one-hot policy or 1 visit case, etc.

And what I'm constantly insisting is that there is nothing wrong with benefiting from tree reuse, or the results varying with playouts. You are tacitly suggesting that the result should be consistent regardless of computational cost, but I have no idea why.

jann · **#136**

I never said it SHOULD. There is nothing wrong with wider tests that are affected by more factors - time based tests even. But such results are inherently harder to interpret and less portable, and IF/WHEN you are unable to test on the exact conditions of later use, narrower tests that are affected by less factors and thus more consistent can be more informative with less danger of being misleading. Advantages and disadvantages.

inbae · **#137**

jann wrote:

But such results are inherently harder to interpret and less portable, and IF/WHEN you are unable to test on the exact conditions of later use, narrower tests that are affected by less factors and thus more consistent can be more informative with less danger of being misleading. Advantages and disadvantages.

I think you are trying to oversimplify things. As I have said above, all of [network, playouts/visits, number of threads] should be considered at least, and given those, tests should be reproducible to an extent. Ultimately strength matters, and tests without such details or contexts are less likely to be informative for users.

jann · **#138**

If you can test with the exact target conditions, by all means do so. If you can not, and need to speculate from tests performed on different conditions, you are better off with more consistent tests that only measure a specific factor each.

inbae · **#139**

I consider fixed playout tests more suitable for such purposes, since every engine will reuse the search tree and number of playouts is supposed to be more correlated to time limit than the number of visits is.

Limeztone · **#140**

jann wrote:

Limeztone wrote:

As I understand visits vs playouts is that if you clear the tree for every move made, visits and playouts become the same.

Thus a big change for a playout based test (which was affected by a random search bonus without this).

What is random about it?
Bonus compared to what?

Quote:

the same net with the same maxPlayouts could be different in strength depending on the number of threads (or executed on different hardware)

More search threads means weaker search (less freedom in which nodes to visit/expand).

Oh, thanks! Of course :-)

KataGo V1.3

Who is online