Now let’s say you already extracted a bunch of data by making API requests with the above-mentioned params, it’s time so that you can determine the way you need to write them to the vacation spot desk.
👉 Reply: Merge/Dedup mode (really useful)
This query considerations the selection of Write disposition or Sync mode. The instant reply is that, given you wish to load your information incrementally, you’ll probably choose to write down your extracted information in both append mode or merge mode (often known as deduplication mode).
Nonetheless, let’s step again to look at our choices extra intently and decide which technique is greatest fitted to incremental loading.
Listed below are the favored write tendencies.
- 🟪 overwrite/change: drop all present data within the vacation spot tables after which insert the extracted data.
- 🟪 append: merely append extracted data to the vacation spot tables.
- 🟪 merge / dedup: insert new(*) data and replace(**) present data.
(*) How do we all know which data are new?: Normally, we’ll use a main key to find out that. For those who use dlt, their merging technique might be extra subtle than that, together with the excellence between merge_key
and primary_key
(one is used for merging and one is used for dedupication earlier than merging) or dedup_sort
(which data are to be deleted with the identical key within the dedup course of). I’ll go away that half for one more tutorial.
(**) This can be a easy rationalization, if you wish to discover out extra about how dlt handles this merging technique, learn extra here.
👁️👁️ Right here is an instance to assist us perceive the outcomes of various write tendencies.
↪️ On 2024.06.19: We make the primary sync.
🅰️ Information in supply software
️️
🅱️ ️Information loaded to our vacation spot database
It doesn’t matter what sync technique you select, the desk on the vacation spot is actually a replica of the supply desk.
Saved state of updated_at
= 2024–06–03, which is the most recent updated_at
mong the two data we synced.
↪️ On 2024.06.2: We make the second sync.
🅰️ ️️️️️️️Information in supply software
✍️ Adjustments within the supply desk:
- Document id=1 was up to date (gross sales determine).
- Document id=2 was dropped.
- Document id=3 was inserted.
At this sync, we ONLY extract data with the updated_at
> 2024–06–03 (state saved from final sync). Subsequently, we’ll extracted solely document id=1 and id=3. Since document id=2 was faraway from the supply information, there isn’t any method for us to acknowledge this variation.
With the second sync, you now will see the distinction among the many write methods.
🅱️ Information loaded to our vacation spot database
❗ Situation 1: Overwrite
The vacation spot desk will likely be overwritten by the two data extracted this time.
❗ Situation 2: Append
The two extracted data will likely be appended to the vacation spot desk, the prevailing data will not be affected.
❗ Situation 3: Merge or dedup
The two extracted data with id=1 and three will change the prevailing data at vacation spot. This processing is so known as merging or deduplicating. Document id=2 within the vacation spot desk stays intact.
🟢 Takeaways: The merge (dedup) technique might be efficient within the incremental information loading pipeline, but when your desk could be very massive, this dedup course of would possibly take a substantial period of time.