An Overview of Approaches to Quantify Open Data Catalog Similarity

Martinez-Gil, Jorge An Overview of Approaches to Quantify Open Data Catalog Similarity., 2023 [Preprint]

[thumbnail of odc-similarity.pdf]
Preview
Text
odc-similarity.pdf

Download (264kB) | Preview

English abstract

As open data initiatives continue to gain importance, the need for effective methods to assess the similarity between different open data catalogs becomes increasingly essential. The task of measuring catalog similarity can be helpful in many processes, such as catalog curation, data discovery, and interconnectivity between various open data repositories. This research provides an overview of existing approaches to quantify the similarity between open data catalogs. We explore various strategies ranging from the use of traditional methods based on comparing triples to advanced semantic-based and hashing methods for specific domain languages. Additionally, we identify key challenges and future research directions in open data catalog similarity measurement.

Item type: Preprint
Keywords: Open Data, Data Catalogs, Semantic Similarity Measurement
Subjects: I. Information treatment for information services > IA. Cataloging, bibliographic control.
I. Information treatment for information services > IM. Open data
Depositing user: Dr Jorge Martinez-Gil
Date deposited: 02 Oct 2023 14:25
Last modified: 02 Oct 2023 14:25
URI: http://hdl.handle.net/10760/44881

References

[1] Albertoni, R., Browning, D., Cox, S., Gonzalez-Beltran, A. N., Perego, A., & Winstanley, P. (2023). The W3C data catalog vocabulary, version 2: Rationale, design principles, and uptake. CoRR, abs/2303.08883.

[2] Albertoni, R., & Isaac, A. (2021). Introducing the data quality vocabulary (DQV). Semantic Web, 12 , 81–97.

[3] Bergroth, L., Hakonen, H., & Raita, T. (2000). A survey of longest common subsequence algorithms. In Proceedings Seventh International Symposium on String Processing and Information Retrieval. SPIRE 2000 (pp. 39–48). IEEE.

[4] Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2019). BERT: pre-training of deep bidirectional transformers for language understanding. In J. Burstein, C. Doran, & T. Solorio (Eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers) (pp. 4171–4186). Association for Computational Linguistics.

[5] Dzeroski, S., & Zenko, B. (2004). Is combining classifiers with stacking better than selecting the best one? Mach. Learn., 54 , 255–273.

[6] English, T. M., & Gotesman, M. (1995). Stacked generalization and fitness ranking in evolutionary algorithms. In J. R. McDonnell, R. G. Reynolds, & D. B. Fogel (Eds.), Proceedings of the Fourth Annual Conference on Evolutionary Programming, EP 1995, San Diego, CA, USA, March 1-3, 1995 (pp. 205–218). A Bradford Book, MIT Press. Cambridge, Massachusetts.

[7] Goldberg, Y. (2017). Neural network methods for natural language processing. Synthesis lectures on human language technologies, 10, 1–309.

[8] Lakomaa, E., & Kallberg, J. (2013). Open data as a foundation for innovation: The enabling effect of free public sector information for entrepreneurs. IEEE Access, 1, 558–563.

[9] Maali, F., Erickson, J., & Archer, P. (2014). Data catalog vocabulary (dcat). w3c recommendation. World Wide Web Consortium, (pp. 29–126).

[10] Martinez-Gil, J. (2019). Semantic similarity aggregators for very short textual expressions: a case study on landmarks and points of interest. J. Intell. Inf. Syst., 53, 361–380.

[11] Martinez-Gil, J. (2022). A comprehensive review of stacking methods for semantic similarity measurement. Machine Learning with Applications, 10, 100423.

[12] Martinez-Gil, J. (2023). A comparative study of ensemble techniques based on genetic programming: A case study in semantic similarity assessment. Int. J. Softw. Eng. Knowl. Eng., 33, 289–312. doi:10.1142/S0218194022500772.

[13] Martinez-Gil, J. (2023). Framework to automatically determine the quality of open data catalogs. CoRR, abs/2307.15464 . URL: https://doi.org/10.48550/arXiv.2307.15464.

[14] Martinez-Gil, J., & Chaves-Gonzalez, J. M. (2019). Automatic design of semantic similarity controllers based on fuzzy logics. Expert Syst. Appl., 131 , 45–59.

[15] Martinez-Gil, J., & Chaves-Gonzalez, J. M. (2021). Semantic similarity controllers: On the trade-off between accuracy and interpretability. Knowl. Based Syst., 234, 107609.

[16] Martinez-Gil, J., & Chaves-Gonzalez, J. M. (2022). Sustainable semantic similarity assessment. Journal of Intelligent & Fuzzy Systems, 43, 6163–6174.

[17] Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States.(pp. 3111–3119).

[18] Navigli, R., & Martelli, F. (2019). An overview of word and sense similarity. Nat. Lang. Eng.,25 , 693–714.

[19] Paoletti, A. L., Martinez-Gil, J., & Schewe, K. (2016). Top-k matching queries for filter based profile matching in knowledge bases. In S. Hartmann, & H. Ma (Eds.), Database and Expert Systems Applications - 27th International Conference, DEXA 2016, Porto, Portugal, September 5-8, 2016, Proceedings, Part II (pp. 295–302). Springer volume 9828 of Lecture Notes in Computer Science.

[20] Schleimer, S., Wilkerson, D. S., & Aiken, A. (2003). Winnowing: local algorithms for document fingerprinting. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data (pp. 76–85).

[21] Skoda, P., Bernhauer, D., Necasky, M., Klımek, J., & Skopal, T. (2020). Evaluation framework for search methods focused on dataset findability in open data catalogs. In M. Indrawan-Santiago, E. Pardede, I. L. Salvadori, M. Steinbauer, I. Khalil, & G. Kotsis (Eds.), iiWAS ’20: The 22nd International Conference on Information Integration and Web-based Applications& Services, Virtual Event / Chiang Mai, Thailand, November 30 - December 2, 2020 (pp. 200–209). ACM.

[22] Subramaniam, P., Ma, Y., Li, C., Mohanty, I., & Fernandez, R. C. (2021). Comprehensive and comprehensible data catalogs: The what, who, where, when, why, and how of metadata management. CoRR, abs/2103.07532. arXiv:2103.07532.


Downloads

Downloads per month over past year

Actions (login required)

View Item View Item