Using Kendall's tau to compare recommendations

Kendall’s tau is a metric used to compare the order of two lists. It is only defined if both
lists contain exactly the same elements. When comparing only a part of two lists, for example the top-5 elements
generated by two recommenders, it cannot be used as these are unlikely to contain only common items. With a few
tricks one can however leverage the Kendall tau to compare such lists.

Intro

The Kendall tau is a metric that can be used to compare the order of two ranks. It takes two ranks that contain the same elements and calculates the correlation between them. A correlation of $+1$ means the ranks are equal, while a correlation of $-1$ means the ranks are exactly eachother’s reverse. If two ranks are independent or randomly shuffled the correlation will be zero on average.

The Kendall tau is undefined when the two ranks do not contain exactly the same elements. So when one wants to compare
ranks which do not necessarily contain the same elements one needs to look elsewhere. The same goes for comparing only
parts (for example 10 highest entries) of ranks, as these will not likely contain the same elements.

Below I’ll explain how you can use a few tricks to make Kendall’s tau fit for comparing ranks which do not necessarily
contain the same items. At the end I’ve provided with a python implementation that should get anyone up and
running in minutes.

Use case

So why would you want to compare two ranks that do not contain the same elements?

You may be interested in a part of a rank, for example the top-10 items. An example of such a case is
the output of a recommender: the recommender will provide you with a ranked list of all items it can possibly recommend,
but you will only show the best few to the user.

When comparing the output of two recommenders it does not make sense to compare items that will not be displayed. If
recommenders $A$ and $B$ return the same top-10 items they are essentially equal, regardless of whether the item
in position 100 of $A$ is also in position 100 of $B$ . However, we can not use Kendall’s
tau to directly compare the first 10 items of $A$ with the first 10 items of $B$ , as they are likely to not contain exactly
the same elements.

Obviously, the best way to compare two recommenders is an A/B test. But performing an A/B test requires time and resources
you may not always have. In that case you can also directly compare the output of the recommenders.

For example, I have used this method to evaluate a change in a recommender which traded a performance gain for a
reduction in accuracy. A direct comparison of the results can provide you with valuable information on the effect of
the reduced accuracy. If the results are similar you can push the change to production, if not you may want
to stop wasting your time and forget about the change.

Definition

The most used definition of the Kendall tau between two ranks $a$ and $b$ is:

 $\tau \equiv \frac{n_c - n_d}{\sqrt{(n_0-n_a)(n_0-n_b)}},$

where $n_c$ and $n_d$ are the number of concordant pairs and the number of discordant pairs respectively,
$n_0$ is defined as

 $n_0 \equiv \frac{n(n-1)}{2},$

where $n$ is the length of the ranks, and $n_a$ and $n_b$ account for ties in the ranks, and are defined as

 $\begin{aligned} n_a &\equiv \sum_i \frac{t^a_i (t^a_i-1)}{2},\\\\ n_b &\equiv \sum_j \frac{t^b_j (t^b_j-1)}{2}, \end{aligned}$

where $t^a_i$ and $t^b_j$ are the number of (tied) items that share the $i^\text{th}$ place in rank $a$ ,
and the number of (tied) items that share the $j^\text{th}$ place in rank $b$ respectively.

Between the ranks $a$ and $b$ , a pair of items $x$ and $y$ is

concordant if

 $(a_x > a_y \wedge b_x > b_y) | (a_x < a_y \wedge b_x < b_y)$

discordant
(a_x > a_y \wedge b_x < b_y) | (a_x < a_y \wedge b_x > b_y)
neither concordant nor discordant in the case of ties,
```
 $a_x = a_y \wedge b_x = b_y$ 
```

When there are no ties, this reduces to the simpler form

 $\tau \equiv \frac{n_c - n_d}{n_0},$

as both $n_a$ and $n_b$ will be zero.

An example

Let’s have a look at these two ranks:

rank_a = {'apple': 0, 'banana': 2, 'kiwi': 3, 'pear': 1}
rank_b = {'apple': 2, 'banana': 1, 'kiwi': 3, 'pear': 0}

Here $a\text{apple} < a\text{pear}$ , while $b\text{apple} > b\text{pear}$ , so this pair is discordant.
Also, $a\text{pear} < a\text{banana}$ and $b\text{pear} < b\text{banana}$ , so that pair is condordant.
In total, this example contains four concordant pairs and two discordant pairs. Since $n=4$ , this means that

 $\tau = \frac{4-2}{4(4-1)/2} = \frac{1}{3},$

which means the two ranks are slightly correlated.

Since there are no ties in the above example, we can directly translate it to two lists:

a = ['apple', 'pear', 'banana', 'kiwi']
b = ['pear', 'banana', 'apple', 'kiwi']

In list-form, a concordant pair has the same order in both lists, while the order of a discordant pair is swapped between
the lists. Two elements cannot occupy the same spot, so we can not have ties.

Dealing with mismatches

The definition of the Kendall tau cannot handle items which occur in only one of the lists. But when we are
comparing the top-5 results from recommender $a$ with the top-5 results from recommender $b$ we can expect
to have mismatches: it is unlikely that the algorithms will produce the same results. Nevertheless, we do not
want to compare the entire ranks produced by the recommenders, as we are not at all interested in the bottom of the
rank.

Ignoring mismatches

The easiest way to deal with mismatches is to ignore them. By simply removing all elements that do not occur in both
lists we reduce the problem to one that we can solve. For example,

a = ['apple', 'pear', 'banana', 'kiwi', 'pineapple']
b = ['pear', 'orange', 'banana', 'apple', 'kiwi']

would be reduced to the example above, where $\tau = \frac{1}{3}$ . At first glance, that seems acceptable. However, in here:

a = ['apple', 'pear', 'banana', 'kiwi', 'pineapple']
b = ['apple', 'pear', 'banana', 'kiwi', 'orange']

lists a and b would be reduced to identical lists, yielding $\tau = 1$ even though the original lists are not the same.

The problems become worse when there are more mismatches in the list:

a = ['pineapple', 'lemon', 'apple', 'kiwi', 'grape']
b = ['apple', 'pear', 'banana', 'kiwi', 'orange']

would also evaluate to $\tau = 1$ (since apple and kiwi are concordant), while I would argue the lists are far from
equal. In the extreme case that there is only one match,

a = ['pineapple', 'lemon', 'apple', 'kiwi', 'grape']
b = ['apple', 'pear', 'banana', 'plum', 'orange']

the Kendall tau is undefined, as the denominator $n(n-1)/2 = 0$ for $n=1$ . So clearly, ignoring mismatches is not the solution to our problem.

Appending mismatches

Instead of ignoring the mismatches, we can also append them to the bottom of the other list. In a way this makes sense, since all results are in both lists, just not necessarily in the top-5. Since we do not have any information on where
the results are in the list (below the top-5) we should treat all results below the top-5 as equal. For example:

a = ['pineapple', 'apple', 'pear', 'kiwi', 'grape']
b = ['apple', 'pear', 'banana', 'kiwi', 'orange']

would then become

rank_a = {'apple': 1, 'banana': 5, 'grape': 4, 'orange': 5, 'pear': 2, 'pineapple': 0, 'kiwi': 3}
rank_b = {'apple': 0, 'banana': 2, 'grape': 5, 'orange': 4, 'pear': 1, 'pineapple': 5, 'kiwi': 3}

where banana and orange end up in a tied 5th place in rank a, and pineapple and grape share a tied 5th place
in rank b. This yields $\tau \approx 0.15$ , which is close to what I would expect.

If we do this for other examples we get some nice results:

a = ['apple', 'pear', 'banana', 'kiwi', 'grape']
appended_tau(a, a) -> 1

# replace last element
b = ['apple', 'pear', 'banana', 'kiwi', 'orange']
appended_tau(a, b) -> 0.87

# replace first element
c = ['orange', 'pear', 'banana', 'kiwi', 'grape']
appended_tau(a, c) -> -0.20

# replace two elements
d = ['orange', 'pear', 'pineapple', 'kiwi', 'grape']
appended_tau(a, d) -> -0.45

# replace all elements
e = ['orange', 'tomato', 'pineapple', 'lemon', 'plum']
appended_tau(a, e) -> -0.71

# invert
f = ['grape', 'kiwi', 'banana', 'pear', 'apple']
appended_tau(a, f) -> -1

The correlation is reduced when an element is replaced, and replacing the first element has significantly
more effect than replacing the last. That is exactly as we expect. However, replacing all elements
results in a higher correlations than inverting the list. I would say this is not what we expect: I would say an inverted top-5 is closer to the original than one that is completely different.

So why is the correlation of a and e larger than that of a and f? It is clear that between a and f all
pairs of words are discordant. If we have a closer look at the resulting ranks when we compare a and e:

rank_a = {'apple': 0, 'banana': 2, 'grape': 4, 'kiwi': 3, 'lemon': 5,
          'orange': 5, 'pear': 1, 'pineapple': 5, 'plum': 5, 'tomato': 5}
rank_e = {'apple': 5, 'banana': 5, 'grape': 5, 'kiwi': 5, 'lemon': 3,
          'orange': 0, 'pear': 5, 'pineapple': 2, 'plum': 4, 'tomato': 1}

we see that also here there are no concordant pairs. Not all pairs are discordant, however, as all pairs of which both
items occur in the same list are tied in the rank of the other. For example apple and banana are tied in rank e,
and are therefore not discordant. So the difference between the two correlations comes from the difference in number
of ties, or the difference in length of the ranks.

Extending the ranks

We can fix the inbalance in number of ties by adding a number of dummy items to the ranks, such that all ranks have the
same length. The ranks we obtain when we compare two lists can never be longer than twice the length of a list,
which is the case then the lists have no common elements. Therefore, if we extend all ranks to twice the length of a list,
every rank will have the same length.

In the example of a and e above the lists share no common items and the ranks are twice the length of the lists.
This means we will not add any dummy items, and the correlation will remain $\tau = -0.71$ . In the case of a and f,
however, all items are common, and therefore the ranks have the same length as the lists. We will thus extend the ranks
with dummy items (let’s take grains):

a = ['apple', 'pear', 'banana', 'kiwi', 'grape']
f = ['grape', 'kiwi', 'banana', 'pear', 'apple']

rank_a = {'apple': 0, 'banana': 2, 'grape': 4, 'kiwi': 3,  'pear': 1,
          'wheat': 5, 'barley': 5, 'rice': 5, 'corn': 5, 'rye': 5}
rank_f = {'apple': 4, 'banana': 2, 'grape': 0, 'kiwi': 1,  'pear': 3,
          'wheat': 5, 'barley': 5, 'rice': 5, 'corn': 5, 'rye': 5}

which yields a correlation of $\tau \approx 0.45$ .

Now isn’t that a rather high correlation between a list and its inverse? For just these lists that is a
rather high correlation indeed, but remember that we are comparing the top-5 highest ranked items of (much) longer
lists. The fact that the lists have the same items in their five highest entries makes them fairly similar in my
opinion.

Now let’s have a look at a few examples:

a = ['apple', 'pear', 'banana', 'kiwi', 'grape']
extended_tau(a, a) -> 1

# replace the last element
b = ['apple', 'pear', 'banana', 'kiwi', 'lemon']
extended_tau(a, b) -> 0.83

# invert
c = ['grape', 'kiwi', 'banana', 'pear', 'apple']
extended_tau(a, c) -> 0.43

# replace the first element
d = ['tomato', 'pear', 'banana', 'kiwi', 'grape']
extended_tau(a, d) -> 0.37

# replace three elements and add three new
e = ['lemon', 'tomato', 'apple', 'pineapple', 'grape']
extended_tau(a, e) -> -0.23

# replace all elements
f = ['orange', 'tomato', 'pineapple', 'lemon', 'plum']
extended_tau(a, f) -> -0.71

These results look right: every move to scramble the list even more results in a lower correlation. Replacing the first
element has more impact than replacing the last, and inverting the list results in a far greater correlation than
replacing all elements. The only problem left is the scale: we expect the correlation to scale in the range $[-1, +1]$ , but
in this case the minimum value lies around $-0.71$ . This minimum value depends on the length of the lists we compare.

Scaling the result

We can calculate the minimum value as a function of the length of our lists. The minimum correlation
is achieved when the two lists have no overlapping elements. Going back to the definition,

 $\tau \equiv \frac{n_c - n_d}{\sqrt{(n_0-n_a)(n_0-n_b)}},$

this means that that $n_c = 0$ , and $n = 2l$ , where $l$ is the length of the lists. In each of the ranks there will be
$l$ elements that are tied in the last position, and there will be no other ties. This means that $n_a = n_b = l(l-1)/2$ .
Also, $n_d = n_0 - n_a - n_b$ , as all pairs, except those that are tied, are discordant. Thus,

 $\begin{aligned} \tau_{\min}(l) &= -\frac{n_0 - 2n_a}{n_0 - n_b}\\\\ &= -\frac{n(n-1) - 2l(l-1)}{n(n-1) - l(l-1)}\\\\ &= -\frac{2l(2l-1) - 2l(l-1)}{2l(2l-1) - l(l-1)}. \end{aligned}$

When $l=5$ , this results in $\tau_{\min} \approx -0.71$ .

Using the result above, we can scale the extended tau such that it covers an interval of $[-1, +1]$ :

 $\tau_s \equiv 2\frac{\tau-\tau_{\min}(l)}{1-\tau_{\min}(l)} - 1.$

Summary

The Kendall tau is undefined for lists that do not contain the same elements, which prevents us from using it
to compare parts of ranks.
We can circumvent this limitation by appending all missing elements to the rank of either list in a tied last
position, behind all elements present in the list. A variable number of tied elements skews the resulting correlations,
which we can prevent by adding tied dummy items to both ranks. In the end we have to scale the result to ensure
it falls in the interval $[-1, +1]$ .

It is important to note that the resulting ‘correlation’, although a measure of the similarity between two ranks,
does not have all properties anymore you would expect of a Kendall tau, or any correlation in general. Obviously, the
result of comparing a list with its inverse is no longer $-1$ , and the expected result between randomly scrambled lists
is no longer . Nevertheless the measure can be useful when comparing the top elements of ranks.

A python implementation is shown below.

import pandas as pd
import scipy.stats as stats

def extended_tau(list_a, list_b):
    """ Calculate the extended Kendall tau from two lists. """
    ranks = join_ranks(create_rank(list_a), create_rank(list_b)).fillna(len(list_a))
    dummy_df = pd.DataFrame([{'rank_a': len(list_a), 'rank_b': len(list_b)} for i in range(len(list_a)*2-len(ranks))])
    total_df = ranks.append(dummy_df)
    return scale_tau(len(list_a), stats.kendalltau(total_df['rank_a'], total_df['rank_b'])[0])

def scale_tau(length, value):
    """ Scale an extended tau correlation such that it falls in [-1, +1]. """
    n_0 = 2*length*(2*length-1)
    n_a = length*(length-1)
    n_d = n_0 - n_a
    min_tau = (2.*n_a - n_0) / (n_d)
    return 2*(value-min_tau)/(1-min_tau) - 1

def create_rank(a):
    """ Convert an ordered list to a DataFrame with ranks. """
    return pd.DataFrame(
                  zip(a, range(len(a))),
                  columns=['key', 'rank'])\
             .set_index('key')

def join_ranks(rank_a, rank_b):
    """ Join two rank DataFrames. """
    return rank_a.join(rank_b, lsuffix='_a', rsuffix='_b', how='outer')

Using Kendall’s tau to compare recommendations