Skip to content

Fix off-by-one-error in CsvEnumerator#52

Merged
stellarxo merged 2 commits into
masterfrom
fix-off-by-one-error-in-csv_enumerator
May 21, 2020
Merged

Fix off-by-one-error in CsvEnumerator#52
stellarxo merged 2 commits into
masterfrom
fix-off-by-one-error-in-csv_enumerator

Conversation

@sunyoung-shopify

@sunyoung-shopify sunyoung-shopify commented May 13, 2020

Copy link
Copy Markdown
Contributor

What are you trying to accomplish?

I'm addressing an off-by-one-error in the iteration of the CsvEnumerator. I noticed the bug when I was using JobIteration::CsvEnumerator.new(csv).rows(cursor: cursor) in a job that imports records. When the job would resume after interruption, it would begin by re-processing the record that was just sucessfully imported prior to interruption. That is, the job would re-process the last iteration instead of the moving onto the next iteration as expected. This problem is happening because the CsvEnumerator passes the given cursor position directly into #drop without incrementing it first. #drop expects the number of elements to remove from an array, not the index of the element we want to drop. Since we just pass in the index of the last sucessful iteration and #drop isn't positional, we end up removing all the elements before the last iteration, but not the last iteration itself.

I noticed the batches version of the CsvEnumerator had the same problem, so I went ahead and updated that method and its tests too.

A quick shoutout to @alanly for working with me on this and spotting the error in the code 🙏

What approach did you choose and why?

  • I handle the cursor in the same way as the #build_array_enumerator. I increment the cursor by 1 before passing it to #drop, unless the cursor is nil, in which case 0 is passed instead
  • I updated the existing tests to now work with the fix
  • I removed the considers cursor: nil as the start tests because they're now redundant. We implicitly test that the enumerator properly handles nil in the yields every record/batch with their cursor position tests

What should reviewers focus on?

Are there any other tests or files that should be updated?

@sunyoung-shopify sunyoung-shopify force-pushed the fix-off-by-one-error-in-csv_enumerator branch 2 times, most recently from d539d90 to 491f65b Compare May 13, 2020 12:56
@sunyoung-shopify sunyoung-shopify marked this pull request as ready for review May 13, 2020 15:26
@djmortonShopify

Copy link
Copy Markdown
Contributor

Is there any way the off-by-one nature of this is related to the fact that often CSV files will have a first row containing headers, rather than data? In that way, it might be different that just mimicking what the array enumerator does. Was this tested with both CSV files that contain a header row and CSV files that don't?

@alanly

alanly commented May 20, 2020

Copy link
Copy Markdown
Member

Is there any way the off-by-one nature of this is related to the fact that often CSV files will have a first row containing headers, rather than data?

I believe the behaviour around the header row is dictated by the CSV instance that's passed to the enumerator constructor.

There seems to be existing tests that asserts the behaviour when headers are parsed and when they're not parsed:

test "#rows enumerator returns size excluding headers" do
enum = build_enumerator(open_csv)
.rows(cursor: 0)
assert_equal 11, enum.size
end

test "#rows enumerator returns size including headers" do
enum = build_enumerator(open_csv(headers: false)).rows(cursor: 10)
assert_equal 12, enum.size
end

@alanly alanly left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cursor behaviour introduced here seems to be in line with the documented example where it's passed as a :starting_after argument.

I don't have as much Job Iteration context as the other folks pinged on this, so I don't know if the change is worthy of additional tests (beyond what's been updated here,) or if such a change would require additional consideration when generating a new release.

@@ -30,19 +30,23 @@ def initialize(csv)
# Constructs a enumerator on CSV rows
# @return [Enumerator] Enumerator instance
def rows(cursor:)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth documenting what cursor means for the iteration on resume (specifically that it represents the last successfully processed row?)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it would make sense

@GustavoCaso

Copy link
Copy Markdown
Contributor

I tested locally inside an irb session, and I think it makes total sense the change 😄

csv = [['a', 'a', 'a'], ['bb', 'bb', 'bb'], ['c']]

# Lets start the iteration, cursor is nil
# nil.to_i gets converted to 0

enum = csv.each_with_index.drop(0).to_enum
=> #<Enumerator: [[["a", "a", "a"], 0], [["bb", "bb", "bb"], 1], [["c"], 2]]:each>

# We iterate and the job gets halted after processing element 1
# So now cursor is 1

# If we fetch 1 as it is it will return the previously processed element 
enum = csv.each_with_index.drop(1).to_enum
=> #<Enumerator: [[["bb", "bb", "bb"], 1], [["c"], 2]]:each>

# But if we apply the change from the PR returns the right elements 
enum = csv_dup.each_with_index.drop(2).to_enum
=> #<Enumerator: [[["c"], 2]]:each>

Great work 🎉

Comment thread lib/job-iteration/csv_enumerator.rb Outdated
@stellarxo stellarxo force-pushed the fix-off-by-one-error-in-csv_enumerator branch 2 times, most recently from c9e8aa3 to cfdd449 Compare May 21, 2020 15:50
@stellarxo stellarxo requested a review from GustavoCaso May 21, 2020 15:51
I'm addressing an off-by-one-error in the iteration of the CsvEnumerator.

 When the job would resume after interruption,
it would begin by re-processing the record that was just sucessfully imported prior to interruption.
That is, the job would re-process the last iteration
instead of the moving onto the next iteration as expected.

This problem is happening because the CsvEnumerator passes the given cursor position
directly into #drop without incrementing it first.
not the index of the element we want to drop.
Since we just pass in the index of the last sucessful iteration and #drop isn't positional,
we end up removing all the elements before the last iteration,
but not the last iteration itself.

- I handle the cursor in the same way as the #build_array_enumerator.
  I increment the cursor by 1 before passing it to #drop,
  unless the cursor is nil, in which case 0 is passed instead
@stellarxo stellarxo force-pushed the fix-off-by-one-error-in-csv_enumerator branch from cfdd449 to d89364d Compare May 21, 2020 17:05
@stellarxo stellarxo merged commit 76dc55c into master May 21, 2020
@GustavoCaso GustavoCaso temporarily deployed to rubygems May 22, 2020 12:54 Inactive
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants