H. Leong, S. Kapur, and 0. de Vel
The rapidly growing volume of electronic information available in distributed heterogeneous environments, such as the World Wide Web, has made it increasingly difficult and time-consuming to search for and locate relevant documents (in textual, visual or audio format). To refieve users of the burden of this task, intelligent tools that automate the search and retrieval tasks by generating profiles of user interests with minireal user interaction are required. In this paper, an intelligent knowledge filtering system (SAMURAI) and, in particular, the text summarisation and clustering modules of the system are described. Modules for extracting salient concepts in documents were built and evaluated on a variety of documents from different knowledge domains. A natural language processing approach based on part-of-speech tagging was used and compared with an alternative approach based on the well-known (and commonly-used) TFIDF information retrieval algorithm. Results show that, in the tagger-based approach, more informative keywords and phrases from each document set are extracted than in the TFIDF approach. Furthermore, the tagger-based system has a significantly reduced computation time compared with TFIDF, making it scalable to large document sets.