Big Data Deduplication in Data Lake

View/ Open
Metadata
Show full item record
URI
Collections
Abstract
Data lakes are the next generation of technology to process and store big data. As
usual, new challenges and problems arise inevitably with new technologies. One of these
problems is the occurrence of duplicate data in the storage. Our paper aims to address this
challenge during the data ingestion phase that is currently overlooked or addressed
insufficiently. The first part discusses the design of a suitable architecture for the data lake
and deduplication workflow for processing structured and unstructured data. The proposed
solution is evaluated through experiments that deal with the flexible deduplication window,
the scalability of the proposed solution, the suitable hash function, and the advantages of an
in-memory pointer repository.
- Title
- Big Data Deduplication in Data Lake
- Author
- Hlavačka, Jakub
- Bobák, Martin
- Hluchý, Ladislav
- xmlui.dri2xhtml.METS-1.0.item-date-issued
- 2024
- xmlui.dri2xhtml.METS-1.0.item-rights-access
- Open access
- xmlui.dri2xhtml.METS-1.0.item-identifier-issn
- 1785-8860
- xmlui.dri2xhtml.METS-1.0.item-language
- en
- xmlui.dri2xhtml.METS-1.0.item-format-page
- 20 p.
- xmlui.dri2xhtml.METS-1.0.item-subject-oszkar
- data lake, deduplication, big data
- xmlui.dri2xhtml.METS-1.0.item-description-version
- Kiadói változat
- xmlui.dri2xhtml.METS-1.0.item-identifiers
- DOI: 10.12700/APH.21.11.2024.11.17
- xmlui.dri2xhtml.METS-1.0.item-other-containerTitle
- Acta Polytechnica Hungarica
- xmlui.dri2xhtml.METS-1.0.item-other-containerPeriodicalYear
- 2024
- xmlui.dri2xhtml.METS-1.0.item-other-containerPeriodicalVolume
- 21. évf.
- xmlui.dri2xhtml.METS-1.0.item-other-containerPeriodicalNumber
- 11. sz.
- xmlui.dri2xhtml.METS-1.0.item-type-type
- Tudományos cikk
- xmlui.dri2xhtml.METS-1.0.item-subject-area
- Műszaki tudományok - informatikai tudományok
- xmlui.dri2xhtml.METS-1.0.item-publisher-university
- Óbudai Egyetem