|
Computerized Performance Testing: Forward into the PastAbstractThis technology demonstration presents a proof-of-concept project for computerized performance-based certification testing of UNIX System Administrators. Along with a demonstration, a description is given of test construction, implementation timelines, and how implementation costs can be minimized. The advantages and disadvantages of performance-based testing are listed and guidelines are provided to determine which workplace skills are and are not suitable for performance testing. Legal issues are reviewed, especially validity requirements, as is test security. The implicit nature of knowledge testing within performance tests is discussed and demonstrated. A description of the return on investment (ROI) justification is provided. Throughout, tricks to use and pitfalls to be avoided are delineated. The use of highly integrated Electronic Performance Support Systems (EPSS), Computer-based Training (CBT), and Certification Testing for workers is described and related to Shoshana Zuboff's vision of "Informatted Organizations" as described in her book In the Age OF The Smart Machine. A bibliography is available.OverviewThis section reviews the background of performance testing and relates it to concepts familiar to creators of knowledge-based tests. What is performance testing?Performance testing is simply the demonstration that a person can apply their knowledge and skills to perform tasks, take actions, and solve problems that are realistic indicators of applied proficiency or competence (McBeath, 1992). This paper covers the use of computers to test people on specific areas of computer-usage skills, knowledge, and abilities. This stands in contrast with the more common computer-based knowledge testing, in which both lower-order (e.g. recall, identification) and higher-order (e.g. reasoning, problem-solving) knowledge is tested. Performance testing has been used informally since time immemorial as a part of the master-apprentice teaching cycle, but its use has been limited because of the large amount of resources needed to either observe the participant or adequately simulate the work environment. Because of the number of details that must be observed, the participant:observer ratio is almost always on the order of 1:1. This means that the expense of scoring performance tests using human observers can be high in terms of both money and time. Computer-based performance testing (CPT) relieves these constraints by having the computer itself both observe and score how a candidate performs a task involving the use of the computer. Intuitively this makes a lot of sense for computer applications. Until recently computers have not had sufficient power to create economically feasible performance tests, except for the most critical of jobs. Fortunately we can now see the near future and plan to implement that oldest of tests, the performance test. In 1995, Drake Prometric (now Sylvan Prometric) noted this trend when they said: "As performance-based testing techniques develop, they will improve the fidelity of testing-the degree to which the 'indirect' process of assessment mirrors the 'direct' process (Drake Prometric, 1995). Speaking of performance testing, a Drake spokesperson has said: "both Drake and our clients know that this is the way to go." What performances should be tested?How do you know which performance(s) should be tested?The content for any test is derived from that old friend of the instructional designer: the job and task analysis. This analysis is key to having a defensible test and if performed properly is all the documentation required to support a performance test's use. Simply put, when doing a job is more important than knowing about a job, that job is a candidate for performance testing. Usually a job consists of both knowing something and doing something about it, so many jobs are candidates for a combination of knowledge and performance testing. How do you test the performance(s) you have identified?Once a job analysis is completed, a choice must be made whether to use a knowledge test, a performance test, or a combination. Clark, (1992) makes a germane point about the relationship between knowledge and performance testing when he states:
One of the reasons that both kinds of testing may be needed is that humans appear to use two kinds of knowledge, declarative and procedural. Knowledge-based tests can only test declarative knowledge; however people rely on both declarative and procedural knowledge even for knowledge-based tasks (Clark, (1992)). Fortunately, performance-based tests can measure both kinds of knowledge. Thus for a given procedure, both a newly-trained employee and an experienced-but-untrained employee should both be able to succeed at a performance test, an intellectually and ethically satisfying result. Legal (and Marketing) IssuesThe law surrounding testing issues is less than clear-cut. The work cited most often in the United States is the Uniform Guidelines on Employee Selection Procedures (EEOC, 1978), which defines test fairness and adverse impact. While the standard was originally written to address discriminatory hiring practices, it has come to be viewed with "great deference" by the courts in other areas of testing (Shrock & Coscarelli, 1996). The legal issues concerning testing generally revolve around test validity, though reliability is also a concern (especially for traditional performance tests). Not surprisingly, these are the same issues that need to be considered when marketing a test. Few people can be expected to trust (or purchase) a test that is not both valid and reliable. Indeed, test use is subject to legal challenge if the test does not reflect the true content of the job in question (Douglas, et.al., 1989). One of the advantages of performance testing is that the test developer can guarantee that a test taker has performed a procedure once they pass the performance test for that procedure. The ability of a test-taker to perform a procedure can only be stated in terms of probabilities after they have passed a knowledge test, though this is often sufficient for both legal and marketing purposes. ValidityBecause computer-based performance testing occurs in an environment that requires users to employ the specific abilities under evaluation, it is a relatively accurate way to pinpoint both a user's knowledge and the user's ability to translate that knowledge into activity. As a result, CPT can be a more valid measurement of the criteria under consideration than simple knowledge testing. (See Foster: 1994, 1995.) Validity is the extent to which an assessment fits the concept it is supposed to measure. Taken at face value, then, CPT is a highly valid measure because the test instrument (computer) and the concepts being tested (computer skills) are aligned. However, CPTs withstand tests of validity at levels where knowledge-based testing may not. There are two forms of validity that are of primary importance for performance-based tests as defined by the EEOC, content validity and criterion-related validity (Shrock & Coscarelli, 1996). A third kind of validity, construct validity, relates some underlying trait such as mechanical aptitude to successful job performance. Its use is somewhat more arcane and is not covered here. For tests to meet EEOC standards, they must reflect at least one of these types of validity. Because computer-based performance assessment can easily demonstrate content and criterion-related validity, it is the logical choice for many companies. Content ValidityContent validity reflects the likelihood that a test is an accurate measure of job content. In situations where a user's computer skills are the criteria being measured, a multiple-choice exam might have a lower content validity than a corresponding performance test because the knowledge test has no way of allowing the test-taker to demonstrate computer skills. By placing users at a computer and evaluating their ability to operate it, CPT is much more likely to cover the skills involved in computer operation. CPT, therefore, can have a very high content validity. Criterion-Related ValidityCriterion-related validity (also called concurrent validity) examines whether a test accurately distinguishes between persons that possess the knowledge, skill, or ability under test from those that do not. Let us suppose that a computer system administrator must be able to "partition disk space." If the ability to score high marks on a multiple-choice exam was present in all administrators able to partition disk space and low scores were always received by those unable to partition a disk, then the exam is likely a valid criterion-based measure. To determine the criterion validity of the multiple-choice exam, therefore, subjects would have to partition a disk, and the outcome of this action would have to be compared to the test scores. Well-constructed CPT tends to have high criterion-related validity because the criterion being measured (in this case, the ability to partition a disk) would be assessed by having the subject attempt disk partitioning on a computer simulation. If a subject can partition disk space in a simulation, it stands to reason that the subject would be likely to successfully repeat the behavior in a work setting at a similar machine. ReliabilityIn order for any type of assessment to be valid, it must first be reliable. Reliability is the consistency with which a testing instrument produces results. Every component of the testing, from the wording on an examination to the way in which two different observers might score the exam, is an open threat to reliability. Because computer-based performance assessment occurs entirely within a computerized environment, the number of external variables that could hinder reliability is reduced significantly. Inter-Rater ReliabilityThe major reliability concern in human-scored performance tests is inter-rater reliability. To achieve perfect inter-rater reliability, two independent observers must rate an individual's performance with exactly the same scores. Even the slightest bit of variance between the two scores puts the reliability in jeopardy. Many variables can confound inter-rater reliability, including differences in socio-economic status, educational background, gender, and race. Because software will score results consistently, even if it is used on different systems or networks, computer-based performance testing eliminates most problems associated with inter-rater reliability. It conducts each simulation and evaluation in the same manner, removing errors that may result from multiple observers. Also, computers will not vary their scoring practices because of perceptions regarding a subject's race, gender, religious affiliation, or sexual orientation. This makes it less likely that CPT will be subject to accusations of discrimination or favoritism. Test/Retest ReliabilityTest/retest reliability is the extent to which an individual's test scores remain consistent over a given time period. Several variables may confound test/retest reliability: (1) the subject may remember and repeat responses from a previous test; (2) the subject may acquire new knowledge between the two testing periods; or (3) the application of the test instrument itself may bring about changes in the subject. The first two of these concerns are dealt with in the same way one would deal with a knowledge-based test. The first is remedied by having a large enough item bank. The second is remedied by retesting within a short time frame. The third concern is not problematic in a non-adaptive computer-based test; the computer will administer the test the same way each time, so concerns about differences in the way the test is resented evaporate. Test SecurityTest security has a slightly different emphasis with performance tests than with knowledge tests. Certainly test taker scores must always be kept secure, but there does not need to be the same emphasis on keeping the body of the test secret from potential test takers. With a performance test, it is often the test taker's ability to perform a specific task you are testing; to keep general information about the task they are expected to complete from them prior to the test would be counterproductive, at least. While the exact content of the test may need to be kept secret, so that they cannot memorize keystrokes for example, the tasks to be tested and the criteria used for grading the test can (and likely should) be told to the potential test takers. This leads to the interesting ability to create a version of a computerized performance test that can be used as a teaching tool and to practice the test. While this could raise complaints of "only teaching to the test," if the test is a valid reflection of the job itself then "teaching to the test" is in reality "teaching to the job." Cost ConsiderationsOne of the major advantages of knowledge-based tests is that they can be constructed relatively quickly. There is also the belief that they are cheaper to construct than performance-based tests. In terms of initial test construction this can be true, depending on the size of the item banks involved. It is our experience that over the long term, performance tests require less modification and certainly a smaller item pool than knowledge-based tests. Reducing Simulation CostsA large component of the cost of developing a performance test is the cost of simulation development. This cost can often be sidestepped by using the actual computer application as part of the test. Usually one is concerned that some job task be completed correctly; exactly how it was completed is not relevant. In instructional design terms, one is interested in the product, not the process. When this is the case simulation costs can disappear by measuring only the results of using the application rather than attempting to track how the application is used. Reduced Validity Test CostsDuring the construction of a performance test, one must identify the content to be covered. This content can form the basis of the validity testing for the test because the test is visibly related to the job requirements. This satisfies the need for validity testing according to the EEOC; therefore no additional (expensive) validity studies are required. For knowledge-based tests, relying only on content validity as a legal defense may be insufficient and so a follow-on validity study may need to be conducted adding to the cost of knowledge-based tests. Internal Return on InvestmentComputer-based performance assessment can accurately represent the skill sets possessed by prospective employees. In a highly competitive and rapidly evolving field, it is vital that employees enter the workplace with a grasp of the abilities their jobs demand. In situations where none of the prospective candidates possess all the necessary skills and abilities, CPT can indicate which candidates will require the least amount of training. It can also show whether any candidates possess the skills to begin working and whether they have a strong enough grasp of the materials to pick up the remaining skills just by performing their jobs. The cost of not identifying training needs, while real, is a bit hard to measure. Easier to measure is the cost of hiring the wrong person because of a lack of adequate testing (or a lack of certification). For example, at Sun Microsystems System Administrators are reasonably well paid; starting salaries in the $50,000 range are not uncommon. Assuming that a newly hired Administrator who cannot really do the job is discharged at the end of their 90-day probation period, the direct cost of this mistake is $12,500. (In reality, indirect cost additions put this figure closer to $20,000.) Given the direct cost alone, it is easy to justify the creation of a $250,000 test just by not hiring 20 Administrators who are inadequately skilled. The ProjectThis paper includes only a brief description of the project; demonstrations of the project are more meaningful. DesignThere are two important points to make about the design of the test. First, a through job and task analysis is absolutely required. Second, investing time to create a complete specification for the software components pays for itself may times over. The job and task analysis is needed not only to provide the eventual content for the test, but to serve as documentation of the test's content validity. It also forms part of the ROI justification when used internally to a company. Two months were spent writing the software specification, detailing exactly which job tasks would be tested, the techniques test takers would likely use to perform those tasks, and the criteria by which the test takers would be judged. Most of the tasks ended up being graded on a pass/fail basis, but the criteria for passing and failing had to be completely specified. A prototype multimedia engine was also developed during this period. Care was take to keep the display and the content separated. This would allow the different tasks to be easily inserted into the multimedia framework. This separation of content was also designed to facilitate translation into other languages. ImplementationOnce the design had been completed and a functional prototype produced, work began on completing the test in the shortest time possible. Code-RaisingThis was perhaps the most unique and interesting aspect of the test development process. Originally there were twelve weeks scheduled for programming the tasks. However, when it came time for the programming we were eight weeks behind schedule. One of the team invented the concept of a "Code-Raising" after the old American frontier custom of the "Barn Raising," in which neighbors would come over and help a newcomer build a barn in one day. Once the software specification was written, seven programmers were gathered together in one large development lab. These people were carefully selected to be content experts as well as programmers. This group, working from the detailed software specification, produced 85% of the alpha software code in one week, making up the eight weeks we were behind schedule. Once the alpha code was complete, translation into French, German, Spanish, and British English began. Usability/Formative TestingAgain in an effort to minimize development time, we conducted a usability test of the multimedia interface and the formative test of the job task tests simultaneously. Interestingly, despite the fact that some of the tasks weren't working properly, the system administrators immediately recognized the usefulness of performance testing and saw great benefit to the type of computerized test we had created. One test taker commented, "Well, it failed me even though I did the task correctly, but that's just a software bug. The basic design of the test is sound. It does what it's supposed to do. I like it!" This overwhelming face validity would become an important marketing tool, though we didn't recognize it at the time. Once the identified problems had ben fixed, we began international testing. Again in the interests of time, we conducted usability, formative, and translation testing simultaneously. We tested in London, Paris, Grenoble, Munich, and Mexico City. This testing not only the improved design and fixes to the software, but also ensured that the translations would be accepted by system administrators who were native speakers for each of the five languages. The main findings from international testing were:
The FutureWhile CPT is valuable in and of itself, it can operate synergistically with other technologies to increase the productivity of organizations. In particular, when combined with computer-based training (CBT), computer-based performance testing can create more meaningful training than has been heretofore possible. Additionally, CPT can be combined with other electronic performance support systems (EPSS) to help people more effectively use information that may be changing on a frequent basis. The combinations of CBT and CPT or EPSS and CPT can create a just-in-time (JIT) learning environment that provides people with the information that they need exactly when they need it. Using these tools one can put less or more emphasis on the long-term retention of information as the situation warrants. With JIT training and testing, employees need not modify their schedules to accommodate those of an outside testing agency. Companies using JIT training also can focus their employees' resources on the jobs for which they were hired and not on the task of administering test materials or observing and evaluating the people being tested. As a result of these practices, JIT training allows companies a most efficient and effective allocation of employees, time, and money. This is exactly the kind of instruction that organizations need now and in the future. Shoshana Zuboff, in her book In The Age Of The Smart Machine, points out that:
The combination of computer-based training and computer-based performance testing can provide the technological learning infrastructure that an informatted organization needs to survive. SummaryComputer-based performance testing takes currently available technology and puts it to work for companies requiring accurate and cost-efficient evaluations of their current and prospective employees. By using computers to test computer skills, CPT ensures a valid and reliable method for determining both knowledge and the ability to apply this knowledge in a work setting. CPT also provides an objective setting in which all users' skills are tested by the same criteria, meeting the standards set by the EEOC. Bibliography
|