Nearly all education systems, and many human resource departments, administer tests. They aim to fairly assess various abilities of people they wish to train, promote, or admit. But the traditional way of administering tests suffers from a deep, systemic flaw — it is sensitive to the choice of questions being administered.
A test cannot practically ask every possible question in the tested field. So, a typical test poses a small, fixed number of questions, and scores people based on number of correct answers. This can make results inaccurate, and may introduce bad bias.
How can we improve grading accuracy, while maintaining (or reducing) test length? The field of “Item Response Theory” was born, to answer this question.
Item response theory (IRT) grades question difficulty from easy to hard, and then grades test takers' ability not based on the number of questions they can answer correctly, but based on the difficulty of the questions that they can answer correctly. IRT lets us create fixed and adaptive tests which are less tedious, more accurate and less biased all at the same time.
But there is a catch.
Let's say we have a pool of people, each of whom has answered some subset of a pool of questions. Further, let's make no assumptions about how questions might have been chosen from the pool.
Now, if we knew the difficulty of each question, we could grade the ability of the test taker. A test taker that answers a more difficult question has higher ability. On the other hand, if we knew the ability of each test taker, we could calculate the difficulty of a question. A question that a high-ability person finds hard to answer probably is a difficult question.
But what if we know nothing about either test taker ability or question difficulty to begin with?
We devised a graph-based bootstrapping algorithm to solve this chicken-and-egg problem. With our algorithm, we transform bare data about questions and answers into information about question difficulty, as well as person ability. Our algorithm helps analyse past test datasets to find question difficulties, so that future tests can be administered more accurately. Our algorithm is useful in many educational and HR scenarios.