Sylvain Lamprier, Tassadit Amghar, Bernard Levrat, Frederic Saubion
This paper describes SegGen, a new algorithm for linear text segmentation on general corpuses. It aims to segment texts into thematic homogeneous parts. Several existing methods have been used for this purpose, based on a sequential creation of boundaries. Here, we propose to consider boundaries simultaneously thanks to a genetic algorithm. SegGen uses two criteria: maximization of the internal cohesion of the formed segments and minimization of the similarity of the adjacent segments. First experimental results are promising and SegGen appears to be very competitive compared with existing methods.
Subjects: 13. Natural Language Processing; 1.9 Genetic Algorithms
Submitted: Oct 14, 2006