Judging System

Introduction

There are countless events organised across the world which have entries that need to be evaluated subjectively. This could range in scale and style from a course project to an online meme contest. It is inherently difficult to judge such contests, different people view the results differently. In the status quo, they rely on having certain official judges scoring each project on a scale (say 1 to 100), and then ranking the projects by average score.

There are 2 key issues in this approach:

It is difficult to extend it to scenarios with thousands of entries (common in international contests, or even national art contests etc.). This is because each judge cannot be expected to go through all entries, and here arises the fundamental problem of normalising across judges. This can be a difficult task, especially considering typical normalization mechanisms are not robust to a judge simply receiving a bad batch of projects, which will then get normalized up. So for statistical significance in normalization, the random batch of projects assigned needs to be large enough. Moreover, since judges have an idea of only a very small sample space of projects, it is difficult for them to assign scores with a global perspective. They also only see projects one by one, so their metrics and leniency might change as they gain more information. This approach is thus clearly imperfect from a practical implementation perspective.
However, even if you manage to eliminate such errors, there is a more fundamental problem at play. Humans are inherently bad at regression, that is assigning a rating over a range of values. Their estimates often contradict themselves and vary based on not only the information they currently have (which naturally changes over the process of judging), but also their mood. [Cite].

Pairwise Comparisons

However, one thing humans are good at is comparisons. When shown 2 "objects" simultaneously, humans can be consistent with their preference. Keeping in mind the efficacy of pairwise judgements as a metric for evaluation, it is useful to create software that can automate the process of both providing pairs for comparisons and resolving these comparisons into a final ranking. Most of our project focuses on the latter, as it has an interesting reduction to one of Karp's 21 NP-Complete problems. The first part seems harder, for it requires analysis on what type of data a given algorithm performs well on. For now, we've mostly avoided this consideration and left it for future work.

A note on applications

It is important to note that a pairwise-comparisons based approach works best when comparing pairs requires less time. This is not true when comparing large reports (say 50+ pages) or presentations as in such cases the number of comparisons doable by judges would be low. It is however useful in quicker to evaluate, and more subjective cases such as digital art contests, beauty pageants (as much as we condemn them) and meme contests where a judge can be shown 2 objects and can provide a comparison in seconds. As an aside, the same algorithm can be used as a more efficient way of ranking sports teams. Eg. the total points ranking (as used in IPL, EPL, and many other leagues) is actually a greedy heuristic for MFAS which we describe later. What if we could use the more general MFAS based ranking instead? This can be especially used for cricket / football national team rankings as they play each other in a non-league, sporadic format. In theory, our MFAS based approach can be a direct competitor for the more popular ELO rating.

Inspiration

Note that we got the inspiration of the problem from Gavel, an expo judging system designed for HackMIT and now used by many other hackathons. Gavel uses pairwise comparisons, but for the ranking algorithm it initially used a probabilistic approach based on Thurnstone's statistical model of judgements and Tikhonov regularization. It eventually shifted to the more sophisticated (and also probabilistic) CrowdBT algorithm, which not only provides a ranking, but also provides a framework for collecting data. CrowdBT was originally meant for crowdsourcing data but is adapted for judging by Gavel. We have modelled the problem as a deterministic problem instead.

Algorithmic Approach

So given pairwise comparisons (judging data) how can we generate a final ranking? We can model this problem as a graph where each object is a node and a directed edge from project $u$ to $v$ with weight $w$ represents that $v$ was preferred over $u$ by $w$ more judges than $u$ over $v$. Now, we want to find an ordering (ranking) of all nodes such that the number of 'disagreements' in the ranking are minimized. A disagreement is if $u$ gets ranked above $v$ but a judge found $v$ better in a comparison. Notice that if our graph is acyclic, finding a topological ordering suffices and gives 0 disagreements, and there are simple $O(M)$ algorithms to do so using a $queue$ or a $DFS$. Moreover, it is impossible to find an ordering with 0 disagreements if there is a directed cycle in the graph. This creates a bijection between ranking with minimum disagreements and the following problem:

Given a directed acyclic graph, remove the minimum number of edges such that the graph becomes acyclic.

This is the unweighted version of the minimum feedback arc set problem, which is known to be NP-complete. Our formulation also adds weights to each edge, which is also the classic Weighted Minimum Feedback Arc Set problem. This too is NP-complete, naturally. An important thing to note though is that as not many judges will be provide judgement on the same pair of objects, the edge-weights will be small.

MFAS is NP-Hard proof

We can prove MFAS is NP-Hard by reducing Vertex Cover to Feedback Arc Set. As Vertex Cover is NP-Complete, FAS will thus be NP-Hard by definition as we can reduce any problem in NP to Vertex Cover, which can in-turn be reduced to FAS. FAS can be reduced to MFAS thus making it NP-Hard.

The feedback arc set, given a digraph $G$, asks whether the removal of ≤ $k$ edges can make $G$ acyclic. Notice that FAS is in NP as given a subset of edges $E'$ for removal, we can check if $G(V, E/E')$ is acyclic using topological sort, which takes $O(|V| + |E|)$ (linear) time.