School systems regularly use student assessments for accountability purposes. But, as highlighted by our conceptual model, different configurations of assessment usage generate performance-conducive incentives of different strengths for different stakeholders in different school environments. We build a dataset of over 2 million students in 59 countries observed over 6 waves in the international PISA student achievement test 2000-2015. Our empirical model exploits the country panel dimension to investigate reforms in assessment systems over time, where identification comes from taking out country and year fixed effects along with a rich set of student, school, and country measures. We find that the expansion of standardized external comparisons, both school-based and student-based, is associated with improvements in student achievement. The effect of school-based comparison is stronger in countries with initially low performance. Similarly, standardized monitoring without external comparison has a positive effect in initially poorly performing countries. By contrast, the introduction of solely internal testing and internal teacher monitoring including inspectorates does not affect student achievement. Our findings point out the pitfalls of overly broad generalizations from specific country testing systems.