Focusing on issues of intercoder reliability, this paper describes problems experienced in designing coding systems that classify language using discourse-relevant categories. First, given the absence of a consensus among language scholars, we examine options for selecting and structuring code categories, particularly those which have an impact on intercoder reliability. We observe that computer-assisted coding can maximize the options available to researchers for selecting and structuring code categories as well as minimize the problems of achieving and evaluating intercoder reliability. Second, focusing on the latter, we identify three alternative measures of intercoder reliability and present data from reliability tests using the strongest measure. These data show how structural properties of categories, such as their frequency or their status as a second pair part, can influence intercoder reliability. Despite the need to exercise caution in interpreting and reporting the results of reliability testing, researchers should view development of coding systems like development of theories: as a dynamic process in which reliability tests may provide yet further opportunities for theoretical validation.