Incremental vs Full Loading

Loading a table can be performed either incrementally or with a full reload each time. An incremental load is generally better, since it takes less time and transfers less data, however not all tables cannot be loaded incrementally.

Incremental loading

If you are running an ETL process regularly, rather than loading an entire table each time, it is more efficient to load just the rows that have changed. This is known as an incremental load. By default, if a table contains a row called updated_at, DataDuck ETL will use incremental loading based off of that column. If no such column exists, it will load the entire table each time.

If rows can be deleted from a table, you should not use incremental loading either, since DataDuck ETL won't know which rows have been deleted. Soft deleting a row, by setting a column to 'deleted' (for example) is fine to use with incremental loading.

Under the hood, before extracting, DataDuck ETL will check the destination for the latest value of a column, then use that value as a LIMIT when running the extract query.

If you would like to base an incremental load on a different column, such as id or created_at (common in cases where the rows are not expected to change, like an event stream), then you can do so by giving your table a method extract_by_column.

class MyTable < DataDuck::Table
  source :source1, ["id", "created_at", "name"]

  def extract_by_column
    'created_at'
  end

  output({
      :id => :integer,
      :created_at => :datetime,
      :name => :string,
  })
end

Full reloads

Fully reloading a table takes longer, so it is only recommended you do this with tables where it is not possible to use incremental loads.

If you would like to fully reload the table each time, you may give your table an extract_by_column that returns nil. Alternatively, if you want to have an extract_by_column but still reload the entire table each time, you may give it a method should_fully_reload? that returns true. An example of when you might want to do this is if you are reloading an entire table, but doing it in batches.

class MyTableFullyReloaded < DataDuck::Table
  source :source1, ["id", "created_at", "name"]

  def batch_size
    1_000_000 # if there is a lot of data, and you want to use less memory (for example), batching is a good idea
  end

  def extract_by_column
    'created_at'
  end

  def should_fully_reload?
    true
  end

  output({
      :id => :integer,
      :created_at => :datetime,
      :name => :string,
  })
end
This page was autogenerated from incremental_vs_full_loading.md in the DataDuck project. Improvements? Let us know or open a pull request.