Rövidített megjelenítés

Hlavačka, Jakub
Bobák, Martin
Hluchý, Ladislav
2025-08-18T13:34:27Z
2025-08-18T13:34:27Z
2024
1785-8860hu_HU
http://hdl.handle.net/20.500.14044/32350
Data lakes are the next generation of technology to process and store big data. As usual, new challenges and problems arise inevitably with new technologies. One of these problems is the occurrence of duplicate data in the storage. Our paper aims to address this challenge during the data ingestion phase that is currently overlooked or addressed insufficiently. The first part discusses the design of a suitable architecture for the data lake and deduplication workflow for processing structured and unstructured data. The proposed solution is evaluated through experiments that deal with the flexible deduplication window, the scalability of the proposed solution, the suitable hash function, and the advantages of an in-memory pointer repository.hu_HU
dc.formatPDFhu_HU
enhu_HU
Big Data Deduplication in Data Lakehu_HU
Open accesshu_HU
Óbudai Egyetemhu_HU
Budapesthu_HU
Óbudai Egyetemhu_HU
Műszaki tudományok - informatikai tudományokhu_HU
data lakehu_HU
deduplicationhu_HU
big datahu_HU
Tudományos cikkhu_HU
Acta Polytechnica Hungaricahu_HU
local.tempfieldCollectionsFolyóiratcikkekhu_HU
10.12700/APH.21.11.2024.11.17
Kiadói változathu_HU
20 p.hu_HU
11. sz.hu_HU
21. évf.hu_HU
2024hu_HU
Óbudai Egyetemhu_HU


A dokumentumhoz tartozó fájlok

Thumbnail

A dokumentum a következő gyűjtemény(ek)ben található meg

Rövidített megjelenítés