Text mining without document context

SanJuan, Eric and Ibekwe-SanJuan, Fidelia Text mining without document context. Information Processing & Management, 2006, vol. 42, n. 6, pp. 1532-1552. [Journal article (Paginated)]

Preview

PDF
proof-articleIPM.pdf
Download (300kB) | Preview

English abstract

We consider a challenging clustering task: the clustering of muti-word terms without document co-occurrence information in order to form coherent groups of topics. For this task, we developed a methodology taking as input multi-word terms and lexico-syntactic relations between them. Our clustering algorithm, named CPCL is implemented in the TermWatch system. We compared CPCL to other existing clustering algorithms, namely hierarchical and partitioning (k-means, k-medoids). This out-of-context clustering task led us to adapt multi-word term representation for statistical methods and also to refine an existing cluster evaluation metric, the editing distance in order to evaluate the methods. Evaluation was carried out on a list of multi-word terms from the genomic field which comes with a hand built taxonomy. Results showed that while k-means and k-medoids obtained good scores on the editing distance, they were very sensitive to term length. CPCL on the other hand obtained a better cluster homogeneity score and was less sensitive to term length. Also, CPCL showed good adaptability for handling very large and sparse matrices.

Item type:	Journal article (Paginated)
Keywords:	Multi-word term clustering, lexico-syntactic relations, text mining, informetrics, cluster evaluation
Subjects:	I. Information treatment for information services > IB. Content analysis (A and I, class.) I. Information treatment for information services > ID. Knowledge representation.
Depositing user:	Fidelia Ibekwe-SanJuan
Date deposited:	26 Feb 2008
Last modified:	02 Oct 2014 12:10
URI:	http://hdl.handle.net/10760/11148

Check full metadata for this record

References

Downloads

Downloads per month over past year

Actions (login required)

View Item

Facebook

Twitter

RSS