Performance appraisal is the systematic evaluation of employee job performance against defined criteria. It serves both administrative purposes (pay, promotion, termination decisions) and developmental purposes (feedback, goal-setting, identifying training needs). The central measurement challenge is that job performance is a latent construct that must be inferred from observable indicators — and human raters introduce systematic biases (halo, leniency, central tendency, recency) that contaminate those measurements. Decades of research have focused on designing rating formats, training raters, and structuring appraisal processes to minimize these biases while maximizing the accuracy and utility of performance information.
Performance appraisal is one of the most universally practiced and universally disliked activities in organizations. Surveys consistently show that managers, employees, and HR professionals are all dissatisfied with the process — yet organizations continue to invest in it because the alternatives (no systematic performance evaluation at all) are worse. Understanding why appraisal is so difficult requires appreciating that it is fundamentally a measurement problem complicated by social and motivational forces.
The measurement challenge starts with the criterion. Job performance is not a single, directly observable quantity. It is a latent construct inferred from indicators — supervisor ratings, objective output metrics, customer satisfaction scores, peer evaluations. Each indicator captures some aspect of performance while missing others, and each has its own sources of error. Supervisor ratings are the most common criterion measure, but they are susceptible to systematic biases: halo error (overall impression bleeding across dimensions), leniency (rating everyone high to avoid conflict), central tendency (clustering ratings around the midpoint), and recency (overweighting recent events at the expense of the full rating period).
Rating scale format has been a major research focus. Graphic rating scales — simple numbered scales with trait labels like "communication" or "teamwork" — are easy to create but suffer from ambiguity. Behaviorally anchored rating scales (BARS) were developed to address this by providing specific behavioral examples at each scale point, derived through a systematic process involving subject matter experts. Behavioral observation scales (BOS) take a different approach, asking raters to indicate the frequency of specific behaviors. Research shows that these behaviorally-based formats modestly improve rating quality, though they do not eliminate bias entirely because the fundamental constraint is not the scale but the human cognitive process of observation, encoding, storage, retrieval, and judgment.
Rater training represents another intervention strategy. Frame-of-reference (FOR) training is the most effective approach: raters study the performance dimensions, review behavioral examples at each level, practice rating standardized ratee vignettes, and receive feedback on their accuracy. FOR training works because it gives raters a shared mental model of what each performance level looks like, reducing the idiosyncratic interpretation that produces disagreement. Other training approaches — rater error training (teaching raters about halo, leniency, etc.) — can actually backfire by making raters overly self-conscious without giving them a better framework for accurate rating.
The broader context of performance appraisal has shifted in recent years. Several prominent organizations (Deloitte, GE, Adobe) have abandoned traditional annual reviews in favor of more frequent check-ins and continuous feedback. This shift reflects growing recognition that the annual review format — a single conversation covering an entire year — is poorly suited to the pace of modern work. However, the underlying measurement challenges do not disappear with more frequent feedback; they simply occur more often. The fundamental tension between evaluative and developmental purposes, and the difficulty of accurately observing and remembering behavior, persist regardless of the format.