Batch Enumerator Yielding Relations#91
Conversation
Co-authored-by: Étienne Barrié <etienne.barrie@shopify.com>
|
cc @Shopify/rails @Shopify/job-patterns |
|
@etiennebarrie shall we ship this? |
83b4ad2 to
7c9a1be
Compare
| yield relation, cursor_value | ||
| end | ||
| else | ||
| to_enum(:each) |
There was a problem hiding this comment.
You forgot the size here 😄
| def each | ||
| if block_given? | ||
| while (relation = next_batch) | ||
| break if @cursor.nil? |
There was a problem hiding this comment.
When is the @cursor nil? All places that set it set to Array.wrap() and if you pass nil that will return an empty array.
There was a problem hiding this comment.
In #next_batch:
cursor = cursor_values.last
return unless cursor.present?
# The primary key was plucked, but original cursor did not include it, so we should remove it
cursor.pop unless @primary_key_index
@cursor = Array.wrap(cursor)If the last batch is empty, we'll return early here, so @cursor will be nil.
There was a problem hiding this comment.
That is not clear at all. You are relying in hoisting to define the value of a variable and you need to remember that is how the interpreter works. In that case it would be better to do:
cursor = cursor_values.last
if cursor.present?
# The primary key was plucked, but original cursor did not include it, so we should remove it
cursor.pop unless @primary_key_index
@cursor = Array.wrap(cursor)
else
@cursor = nil
endThere was a problem hiding this comment.
Revisited the code and realized that @cursor actually can't be nil 🤦♀️ The value will carry over from the previous batch, even if we return early. The check is unnecessary, so we'll put something up to remove it.
Take two of #86
Context
#active_record_on_batchesis an array of records#update_allor#delete_all) and convert this to a relation:Model.where(id: records.map(&:id)).update_all. This produces an extra queryProposed Solution
#active_record_on_batch_relations, leaving existing batch enumerator API for records intactActiveRecordBatchEnumerator. The existingActiveRecordEnumeratorenumerator andActiveRecordCursorclasses have a lot of duplication and should probably be refactored. I intend to go back and abstract away cursor-related details from this newActiveRecordBatchEnumeratorclass, while also fixing up the existingActiveRecordCursor, in a separate PR.Enumeratorand defines#eachrecord = relation.lastand then querying the cursor columns on therecordto construct the cursor, we pluck only the cursor columns.relation.lastwill load all of the records because we are using aLIMIT, as described here. Consequently, we optimize by only plucking what we need.#each_iterationthan reusing the original relation (which may have had complex logic / joins / etc). This was actually inspired by how Rails does#in_batchesActiveRecordEnumeratorandActiveRecordCursor, with some simplifications, given that now everything is happening within a single object.End Result
SELECT <cursor_columns> FROM relation LIMIT <batch_size>