Data lakes are the next generation of technology to process and store big data. As
usual, new challenges and problems arise inevitably with new technologies. One of these
problems is the occurrence of duplicate data in the storage. Our paper aims to address this
challenge during the data ingestion phase that is currently overlooked or addressed
insufficiently. The first part discusses the design of a suitable architecture for the data lake
and deduplication workflow for processing structured and unstructured data. The proposed
solution is evaluated through experiments that deal with the flexible deduplication window,
the scalability of the proposed solution, the suitable hash function, and the advantages of an
in-memory pointer repository.