Hiroshi Sakamoto, Yoshitsugu Murakami, Hiroki Arimura, and Setsuo Arikawa, Kyushu University, Japan
The new wrapper model for extracting text data from HTML documents is introduced. In this model, an HTML file is considered as an ordered labeled tree. The learning algorithm takes the sequence of pairs of an HTML tree and a set of nodes The nodes indicate the labels to extract from the HTML tree. The goal of the learning algorithm is to output the wrapper which exactly extracts the labels from the HTML trees.