Text mining without document context

SanJuan, Eric and Ibekwe-SanJuan, Fidelia Text mining without document context. Information Processing & Management, 2006, vol. 42, n. 6, pp. 1532-1552. [Journal article (Paginated)]

[img]
Preview
PDF
proof-articleIPM.pdf

Download (300kB) | Preview

English abstract

We consider a challenging clustering task: the clustering of muti-word terms without document co-occurrence information in order to form coherent groups of topics. For this task, we developed a methodology taking as input multi-word terms and lexico-syntactic relations between them. Our clustering algorithm, named CPCL is implemented in the TermWatch system. We compared CPCL to other existing clustering algorithms, namely hierarchical and partitioning (k-means, k-medoids). This out-of-context clustering task led us to adapt multi-word term representation for statistical methods and also to refine an existing cluster evaluation metric, the editing distance in order to evaluate the methods. Evaluation was carried out on a list of multi-word terms from the genomic field which comes with a hand built taxonomy. Results showed that while k-means and k-medoids obtained good scores on the editing distance, they were very sensitive to term length. CPCL on the other hand obtained a better cluster homogeneity score and was less sensitive to term length. Also, CPCL showed good adaptability for handling very large and sparse matrices.

Item type: Journal article (Paginated)
Keywords: Multi-word term clustering, lexico-syntactic relations, text mining, informetrics, cluster evaluation
Subjects: I. Information treatment for information services > IB. Content analysis (A and I, class.)
I. Information treatment for information services > ID. Knowledge representation.
Depositing user: Fidelia Ibekwe-SanJuan
Date deposited: 26 Feb 2008
Last modified: 02 Oct 2014 12:10
URI: http://hdl.handle.net/10760/11148

References

Berry, A., Kaba, B., Nadif, M., SanJuan, E., Sigayret, A., Classification et d´esarticulation de graphes de termes. In: Proceedings of the 7th International conference on Textual Data Statistical Analysis (JADT 2004). Louvain-la-Neuve, Belgium, pp. 160–170.

Braam, R., Moed, H., A., A. V. R., 1991. Mapping science by combined cocitation and word analysis. 2. dynamical aspects. Journal of the American Society for Information Science 42 (2), 252–266.

Callon, T., Courtial, J., Laville, F., 1991. Co-word analysis as a tool for describing the network of interactions between basic and technological research: The case of polymer chemistry. Scientometrics. 22 (1), 155–205.

Church, K. W., Hanks, P., 1990. Word association norms, mutual information and lexicography. Computational Linguistics 16 (1), 22–29.

Cutting, D., Karger, D., Pedersen, J., Tukey, O., June 21-24 1992. Scatter/ Gather: a cluster-based approach to browsing large document collections. In: 15th Annual International conference of ACM on Research and Development in Information Retrieval - ACM SIGIR. Copenhagen, Denmark, pp. 318–329.

Denoeud, L., Garreta, H., Gu´enoche, A., May 2005. Comparison of distance indices between partitions. In: et al., P. L. (Ed.), Proceedings of Applied Stochastic Models and Data Analysis. Brest, pp. 17–20.

Dobrynin, V., Patterson, D., Rooney, D., Contextual Document Clustering. In: Proceedings of the European Conference on Information Retrieval (ECIR’04. Sunderland, UK, pp. 167–180.

Dunning, T., 1993. Accurate methods for statistics of surprise and coincidence. Computational Linguistics (19), 61–74.

Eisen, M., Spellman, P., Brown, P., Botstein, D., 1998. Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Science, USA (95), 14863–14868.

Fellbaum, C. (Ed.), 1998. WordNet, An Electronic Lexical Database. MIT Press.

Glenisson, P., Gl¨anzel, W., Janssens, F., Moor, B. D., 2005. Combining full text and bibliometric information in mapping scientific disciplines. Information Processing and Management 41 (6), 1548–1572.

Hubert, L., Arabie, P., 1985. Comparing partitions. Journal of Classification, 193–218.

Hur, B., Elisseeff, A., Guyon, I., 2002. A stability-based method for discovering structure in clustered data. Pacific Symposium on Biocomputing (7), 6–17.

Ibekwe-SanJuan, F., August 1998a. A linguistic and mathematical method for mapping thematic trends from texts. In: Proceedings of the 13th European Conference on Artificial Intelligence (ECAI). Brighton, UK, pp. 170–174.

Ibekwe-SanJuan, F., 10-14 August 1998b. Terminological variation, a means of identifying research topics from texts. In: Proc. of Joint ACL-COLING’98. Qu´ebec, pp. 564–570.

Ibekwe-SanJuan, F., SanJuan, E., April 2004. Mining textual data through term variant clustering: the termwatch system. In: Proceedings of Recherche d’Information assist´ee par ordinateur (RIAO). Avignon, pp. 26–28.

Jacquemin, C., 2001. Spotting and discovering terms through Natural Language Processing. MIT Press.

Jain, A., Moreau, J., 1987. Bootstrap technique in cluster analysis. Pattern Recognition 20, 547–568.

Karypis, G., Han, E., Kumar, V., 1994. Chameleon: A hierarchical clustering algorithm using dynamic modeling. IEEE Computer: Special issue on Data analysis and mining. 32 (8), 68–75.

Kaufman, L., Rousseeuw, P., 1990. Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons.

Kim, J.-D., Ohta, T., Tsuruoka, Y., Tateisi, Y., Collier, N., 2004. Introduction to the Bio-Entity Recognition Task at JNLPBA. In: In the Proceedings of JNLPBA-04. pp. 70–75.

Milligan, G. W., Cooper, M., 1985. An examination of procedures for determining the number of clusters in a data set. Pychometrika 50, 159–179.

Milligan, G. W., Cooper, M., 1986. A study of the comparability of external criteria for hierarchical cluster analysis. Multivariate Behavioural Research 21, 441–458.

Ng, R., Han, J., 2002. Clarans: A method for clustering objects or spatial data mining. In: IEEE transactions on knowledge and data engineering. Vol. 14.

Pantel, P., Lin, D., August 11-15 2002. Clustering by Committee. In: Annual International conference of ACM on Research and Development in Information retrieval - ACM SIGIR. Tampere, Finland, pp. 199–206.

Polanco, X., Grivel, L., Royaut´e, J., June 7-10 1995. How to do things with terms in informetrics: terminological variation and stabilization as science watch indicators. In: Proceedings of the 5th International Conference of the International Society for Scientometrics and Informetrics. Illinois, USA, pp. 435–444.

Price, L., Thelwall, M., 2005. The clustering power of low frequency words in academic webs. Journal of the American Society for Information Science and Technology 56 (8), 883–888.

Sanjuan, E., Dowdall, J., Ibekwe-Sanjuan, F., Rinaldi, F., october 2005. A symbolic approach to automatic multiword term structering. Computer Speech Language (CSL) 19 (4), 524–542.

Smadja, F., 1993. Retrieving collocations from text: Xtract. Computational Linguistics (19), 143–177.

Tibshirani, R., Walther, G., Hastie, T., 2000. Estimating the number of clusters in a dataset via the gap statistic. In: Technical Report. No. 208. Dept. of Statistics, Stanford University.

Weeds, J., Dowdall, J., an B. Keller, G. S.,Weir, D., 2005. Using distributional similarity to organise biomedical terminology. Terminology: Special Issue on Application-driven terminology engineering 11 (1), 107–141.

Wehrens, R., Buydens, L. M., Fraley, C., Raftery, A. E., 2003. Model-Based Clustering for Image Segmentation and Large Datasets Via Sampling. Tech. Rep. 424, Department of Statistics, University of Washington.

Yeung, K., Ruzzo, W., 2001. Details of the adjusted rand index and clustering algorithms. supplement to the paper ”an experimental study on principal component analysis for clustering gene expression data. Bioinformatics (17),763–774.

Zitt, M., Bassecoulard, E., 1994. Development of a method for detection and trend analysis of research fronts built by lexical or co-citation analysis. Scientometrics, 30 (1), 333–351.


Downloads

Downloads per month over past year

Actions (login required)

View Item View Item