Improving Record Linkage through Metaheuristic Optimization
DOI:
https://doi.org/10.22399/ijcesen.3981Keywords:
Entity resolution, Record Linkage, Blocking Key Selection, Data Quality, Whale Optimization Algorithm, Grey Wolf OptimizerAbstract
The exponential growth of digital data has amplified the importance of record linkage (RL), a fundamental task in data quality management that identifies and merges records referring to the same real-world entity. A critical step in RL is the blocking process, which reduces computational cost by partitioning records into candidate sets. The effectiveness of blocking depends on the choice of blocking keys (BKs), and poor selection can either increase complexity or degrade linkage quality. Since manual BK selection is costly and supervised approaches require labeled data that are often unavailable, recent research has focused on unsupervised optimization-based methods.In this study, we investigate two bio-inspired metaheuristic algorithms—the Whale Optimization Algorithm (WOA) and the Grey Wolf Optimizer (GWO)—for automatic blocking key selection. Both algorithms reformulate BK selection as a feature selection problem, where candidate subsets of keys are optimized using a wrapper-based evaluation function that balances Pair Completeness (PC), Reduction Ratio (RR), and F-measure. WOA exploits the bubble-net hunting strategy of humpback whales, while GWO models the social hierarchy and cooperative hunting behavior of grey wolves, enabling both to effectively balance exploration and exploitation in high-dimensional search spaces.Experimental evaluations on multiple real-world datasets, including standard RL benchmarks and an Arabic dataset, demonstrate that WOA and GWO outperform traditional blocking strategies and achieve competitive performance compared to recent metaheuristic-based methods. Both approaches yield stable convergence, improved recall, and high reduction ratios, confirming their effectiveness and robustness in enhancing large-scale record linkage.
References
[1]. Herzog, T. N., Scheuren, F. J., & Winkler, W. E. (2007). Data quality and record linkage techniques. Springer.
[2]. Christen, P. (2012). Data matching: Concepts and techniques for record linkage, entity resolution, and duplicate detection. Springer.
[3]. Michelson, M., & Knoblock, C. A. (2006). Learning blocking schemes for record linkage. In Proceedings of the 21st National Conference on Artificial Intelligence (pp. 440–445). AAAI Press.
[4]. Bilenko, M., Kamath, B., & Mooney, R. J. (2003). Adaptive blocking: Learning to scale up record linkage. Proceedings of the IEEE International Conference on Data Mining (ICDM), 87–96.
[5]. Ramadan, E., & Christen, P. (2015). Unsupervised blocking key selection for real-time entity resolution. In Proceedings of the 24th ACM International Conference on Information and Knowledge Management (CIKM) (pp. 1947–1950). ACM.
[6]. Dou, Z., Sun, A., & Wong, E. (2016). Unsupervised blocking of imbalanced datasets for record matching. In Advances in Knowledge Discovery and Data Mining (pp. 141–152). Springer.
[7]. O’Hare, J., Jurek-Loughrey, A., & de Campos, C. (2019). An unsupervised blocking technique for more efficient record linkage. Information Systems, 84, 1–14.
[8]. Tran, K., Assadi, A., Ahmadi, S., & Vidal, M. (2020). A semantics-based blocking approach for entity resolution. Journal of Data and Information Quality, 12(2), 1–25.
[9]. Marchant, N. G., Kaplan, D., Elazar, Y., Rubinstein, B. I. P., & Steorts, R. C. (2019). d-blink: Distributed end-to-end Bayesian entity resolution. Advances in Neural Information Processing Systems (NeurIPS), 32, 1–11.
[10]. Steorts, R. C., Hall, R., & Fienberg, S. E. (2014). SMERED: A Bayesian approach to graphical record linkage and de-duplication. Journal of Machine Learning Research, 16(1), 671–704.
[11]. Xue, B., Zhang, M., & Browne, W. N. (2016). Particle swarm optimisation for feature selection in classification: A multi-objective approach. IEEE Transactions on Cybernetics, 43(6), 1656–1671.
[12]. Mirjalili, S., Mirjalili, S. M., & Lewis, A. (2014). Grey Wolf Optimizer. Advances in Engineering Software, 69, 46–61.
[13]. Mirjalili, S., & Lewis, A. (2016). The Whale Optimization Algorithm. Advances in Engineering Software, 95, 51–67.
[14]. Momanyi, P., Yu, H., Kimwele, M., & Mirza, B. (2021). Master-slave binary Grey Wolf Optimizer for optimal feature selection in biomedical data classification. International Journal of Imaging Systems and Technology, 31(4), 1–14.
[15]. El-Ashry, A., Alrahmawy, M., & Rashad, M. (2020). Enhanced quantum-inspired Grey Wolf Optimizer for feature selection. International Journal of Intelligent Systems and Applications, 12(3), 11–20.
[16]. Benkhlaed, H. N., Berrabah, D., Dif, N., & Boufares, F. (2021). An Automatic Blocking Keys Selection For Efficient Record Linkage. International Journal of Organizational and Collective Intelligence (IJOCI), 11(1), 53-70.
[17]. A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios, Duplicate Record Detection: A Survey, IEEE Transactions on Knowledge and Data Engineering, 19(1):1–16, 2007.
[18]. X. Dong, A. Halevy, and J. Madhavan, Reference Reconciliation in Complex Information Spaces, Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, pp. 85–96, 2005.
[19]. A. K. F. S. Köhler, P. Christen, and N. de Freitas, Amazon-Google Products dataset, from the Second String data linkage repository (used in several RL benchmarks, see also: P. Christen, Data Matching, Springer, 2012).
[20]. A. McCallum, K. Nigam, and L. H. Ungar, Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching, Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp. 169–178, 2000.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 International Journal of Computational and Experimental Science and Engineering

This work is licensed under a Creative Commons Attribution 4.0 International License.