Using Unsupervised Internet Based Ability Testing

Views, Issues and Risks: Towards a Guide for Effective Practice


Here’s the nub of the matter in hand within this article: “….applicants were born to lie and all high stakes assessments are likely to elicit deception” (Dilchert, Ones, Viswesvaran and Deller, (2006) cited in Peterson, et al. (2011)).

There – I thought that would get your attention! Particularly if, like many others you have considered using ability testing at a distance via internet mediated technologies in selection. There is clearly a lot of demand for effective tools in this area of testing. Younger, (2008) notes that 100 percent of Fortune 500 employers offer an online process for recruitment. Arthur et al (2011) and Tippins et al (2006) both cite the recognised benefits of using remote internet testing; speed of process with reduced application to hire times, convenience for the respondent in taking the test at a time and place of their choosing, the ‘long reach’ of the method being able to access applicants anywhere without the traditional travel costs, and reduced cost generally, amongst other cited benefits. But these benefits come with potential risks as indicated above.

This article looks at the issues and evidence surrounding this method of test use and proposes a model for effective practice when using Psytech ability tests for unsupervised, internet based remote assessment.

Terminology To begin with, let’s define further the scope of methodologies addressed and terminologies used here. Many sources discuss the pros and cons of computer based testing (CBT). However, CBT can cover both supervised and unsupervised administration modes. Internet Testing is used to mean the delivery of the test to the respondent through the respondent’s own computer via the internet.

If a test is administered in a supervised setting where proceedings are controlled and coordinated by a qualified test administrator, it is often said to be ‘Proctored’, whereas unsupervised modes of administration are discussed in the literature as Unproctored. Thus Unproctored Internet Testing (UIT) is the practice of providing respondents with unsupervised access to tests at a distance via the internet. It is UIT and ability testing which is the subject of this discussion.

Further, the context for assessment is seen as relevant in relation to methods chosen for the administration of tests. Low-stakes testing is defined (Tippins in Tippins et al (2006)) as situations where the consequences for testing affect only the respondent with few consequences for the organisation. Conversely, Tippins defines High-stakes testing as situations where the consequences of testing affects others or institutions beyond the respondent themselves. Recruitment and selection assessment would fit within the high-stakes definition.

Within the permutations of modes and contexts for assessment, UIT in high stakes settings gives the greatest cause for concern, as graphically highlighted by the quote from Dilchert et al in the opening paragraph. And to be explicit, the concern is that if we, the trained test user, are not present in the room to control proceedings we run the risk of what Arthur et al (2010) refer to as ‘malfeasant responding’, meaning ‘cheating’ in some way which will result in a higher score than would otherwise be their true score in classical testing terms. Simply, if people cheat in UIT, the results achieved in this mode of test use will have little or no validity.

But what is the evidence in this area of practice? Do people cheat? How do people cheat if at all? What effect does that cheating have, and if there are clear risks, how can we operate effectively in response to those risks?

Malfeasance – fear or reality? Do people cheat if given the opportunity? The answer is, it’s really not certain if they do or don’t. Beaty et al (2011) note there is literally no published data which shows what happens to test result validity when the test is taken offsite via the internet and administered to job applicants. They note that there are a limited number of published papers which examine score differences between methods of administration (supervised Vs unsupervised) but those that do have limitations in the reported studies. For example, Arthur et al (2010) undertook a study where 9426 applicants who had previously completed UIT (online Verbal and Numerical ability tests) as part of selection were invited to undertake a repeat assessment, but that repeat was also administered via UIT. Only 296 agreed to do so.

The hypothesis was that the first administration represented ‘high stakes assessment’ in which the motivation for malfeasance would be higher, but the second administration was for research only and thus the motivation for malfeasance would be low. If ‘cheating’ had occurred in the first administration repeat scores would be lower on second exposure (although the second exposure was UIT also). Further, the tests used were classical ‘fixed length’ thus the same questions were administered on each exposure. Predictably, a general practice effect was observed meaning that mean scores were generally higher on second exposure. This makes seeing a clear picture of meaningful differences between exposures harder. However, when taken into account, 49.32% scored higher second time around, 42.91% had unchanged scores, and 7.77% had lower scores. Those who did rather better at time 2 were considered to be a result of the practice effect but the 7.77% who did rather less well (beyond what would be expected through error in measurement) at time 2 were considered to be evidence of malfeasance at time 1.

Further insight might be gained from studies in related areas. Griffith and Robie, in Perterson et al (2011) note that the majority of research evidence suggests that a considerable proportion of respondents do ‘fake’ in personality assessment. Faking is a form of malfeasant responding, and if we accept that respondents will offer malfeasant responses in one class of testing, why would we expect that they wouldn’t in another class of testing?

Thus, whilst not conclusive, evidence suggests that we should take the risk of malfeasance seriously. Arthur et al propose that malfeasance in UIT for ability could take the form of illicit aids such as calculators or dictionaries, surrogate test takers (smart friends) or pre-knowledge of test items (items being questions) from security breaches of previously administered test items.

A Practical Solution Tippins et al (2006) discuss at length the various approaches, and relative merits of each approach in response to the risks described for UIT. Additionally, a review of current literature in this area seems to indicate some consensus on appropriate ways to reap the benefits described for UIT whilst mitigating the risks. The commonly prescribed method is to make it clear to respondents at the point of first administration that if successful in reaching the next stage of selection, they will be required to repeat the assessment under supervised conditions.

However, this approach has limitations for classical fixed length tests. As noted, one risk is that if tests are exposed in this way, there is a risk that items will become compromised and the content of the test could become common knowledge. Further, such tests will suffer from a practice effect which makes the desired objective of this methodology, stable repeat scores, harder to observe with clarity.

For these reasons, a different kind of test is required. Alternative testing / test development methodologies are emerging which have benefits for use in UIT. These methods are Item Banked Tests and Computer Adaptive Tests based on Item Response Theory (IRT). Item Banked Tests are developed by producing a number of equivalent or ‘parallel’ alternatives for every item. These are used by the computer administration system selecting at random the equivalent items presented in any administration. With the large number of permutations possible, the chances of a respondent seeing the same items in repeat assessment are low.

Computer Adaptive Tests, based on item response theory (IRT) approaches, are based on a large item pool for which we know a considerable amount about each item, including its difficulty (how much ability a respondent requires to most probably achieve a correct answer) and how well the item discriminates between levels of ability. The adaptive test regime will begin by presenting the respondent with an item of known average difficulty. The subsequent item chosen by the system will depend on if the respondent achieves a correct or incorrect response to the first question. If a correct answer is given, the next question will be selected to be harder (requiring a higher level of ability to most probably achieve a correct answer). Again, in repeated exposure to the same test, the chances of a respondent seeing the same items in repeat assessment are low.

Thus both Item Banked and Computer Adaptive (IRT based) tests enable a process of repeated exposure without the same risks of the picture of repeat performance being clouded by practice effects, and security risks are much lower as only a few of the total possible items are exposed in any one administration.

In Psytech’s GeneSys Online system, there are now three tests (Adapt-g,) IRT3 and CRTBi which are based on these approaches.

Adapt –g and IRT3 follow the common General Ability Model of Verbal, Numerical and Abstract Reasoning whilst CRTBi assesses Verbal and Numerical Critical Reasoning

Practical Example Consider a national graduate recruitment programme, intending to appoint 10 graduate trainees, which produces 150 viable candidates after ‘long listing’ for essential criteria. To bring all 150 candidates forward for assessment would be time consuming and expensive. Thus, to screen down to a short list of 40 candidates (giving a selection ration of 4 to 1) to bring forward to a full assessment centre, an item banked or computer adaptive test is used. A Group Report is generated which rank orders all tested applicants. The top 40 can then be brought forward for the full assessment centre, where the tests can be repeated in fully supervised mode.

Theoretically, this group of 40 could contain:

Genuine high performers Malfeasant high performers The process of repeat assessment in a supervised setting described would show if those brought forward can reproduce the abilities indicated at first exposure.

Variations of this method are also possible. For example, adaptive or item banked assessments might be used for first sift assessment at a distance, providing the advantages of speed and reach described, with proven classical testing being used for second phase assessment through supervised administration. These approaches would be effective in mitigating the major risks for testing when used unsupervised and should perhaps be viewed as a best practice approach where UIT is used in high stakes applications.


Winfred Arthur, Jr, Ryan M. Glaze, Anton J. Villado and Jason E. Taylor (2010): The Magnitude and Extent of Cheating and Response Distortion Effects on Unproctored Internet-Based Tests of Cognitive Ability and Personality International Journal of Selection and Assessment Volume 18 Number 1 March 2010

Dilchert, S, Ones, D.S., Viswevaran, C and Deller, J (2006) Response Distortion in Personality Measurement: born to deceive but capable of providing valid self assessments? Psychology Science, 24, 289 – 298

Peterson, M. H., Griffith, R. L., Isaacson, J. A., O’Connell, M. S., & Mangos, P. M. (2011). Applicant faking, social desirability, and the prediction of counterproductive work behaviors. Human Performance, 24, 270–290

Tippins, N. T., Beaty, J., Drasgow, F., Gibson, W. M., Pearlman, K., Segall, D. O., & Shepherd, W. J. (2006). Unproctored Internet testing in employment settings. Personnel Psychology, 59, 189–225.

Younger, J. Online job recruitment – Trends, benefits, outcomes and implications (2008) Trends,-Benefits,-Outcomes-And-Implications