Big Data Deduplication in Data Lake

Hlavačka, Jakub; Bobák, Martin; Hluchý, Ladislav

Megtekintés/Megnyitás

Hlavacka_Bobak_Hluchy_151.pdf (690.3KB)

Metaadat

Teljes megjelenítés

Link a dokumentumra való hivatkozáshoz:

http://hdl.handle.net/20.500.14044/32350

Gyűjtemény

2.01. 2024 Volume 21, Issue No. 11. [17]

Absztrakt

Data lakes are the next generation of technology to process and store big data. As usual, new challenges and problems arise inevitably with new technologies. One of these problems is the occurrence of duplicate data in the storage. Our paper aims to address this challenge during the data ingestion phase that is currently overlooked or addressed insufficiently. The first part discusses the design of a suitable architecture for the data lake and deduplication workflow for processing structured and unstructured data. The proposed solution is evaluated through experiments that deal with the flexible deduplication window, the scalability of the proposed solution, the suitable hash function, and the advantages of an in-memory pointer repository.

Cím és alcím: Big Data Deduplication in Data Lake
Szerző: Hlavačka, Jakub; Bobák, Martin; Hluchý, Ladislav
Megjelenés ideje: 2024
Hozzáférés szintje: Open access
ISSN, e-ISSN: 1785-8860
Nyelv: en
Terjedelem: 20 p.
Tárgyszó: data lake, deduplication, big data
Változat: Kiadói változat
Egyéb azonosítók: DOI: 10.12700/APH.21.11.2024.11.17
A cikket/könyvrészletet tartalmazó dokumentum címe: Acta Polytechnica Hungarica
A forrás folyóirat éve: 2024
A forrás folyóirat évfolyama: 21. évf.
A forrás folyóirat száma: 11. sz.
Műfaj: Tudományos cikk
Tudományterület: Műszaki tudományok - informatikai tudományok
Egyetem: Óbudai Egyetem