Fix off-by-one-error in CsvEnumerator#52
Conversation
d539d90 to
491f65b
Compare
|
Is there any way the off-by-one nature of this is related to the fact that often CSV files will have a first row containing headers, rather than data? In that way, it might be different that just mimicking what the array enumerator does. Was this tested with both CSV files that contain a header row and CSV files that don't? |
I believe the behaviour around the header row is dictated by the CSV instance that's passed to the enumerator constructor. There seems to be existing tests that asserts the behaviour when headers are parsed and when they're not parsed: job-iteration/test/unit/csv_enumerator_test.rb Lines 44 to 48 in c050da9 job-iteration/test/unit/csv_enumerator_test.rb Lines 60 to 63 in c050da9 |
alanly
left a comment
There was a problem hiding this comment.
The cursor behaviour introduced here seems to be in line with the documented example where it's passed as a :starting_after argument.
I don't have as much Job Iteration context as the other folks pinged on this, so I don't know if the change is worthy of additional tests (beyond what's been updated here,) or if such a change would require additional consideration when generating a new release.
| @@ -30,19 +30,23 @@ def initialize(csv) | |||
| # Constructs a enumerator on CSV rows | |||
| # @return [Enumerator] Enumerator instance | |||
| def rows(cursor:) | |||
There was a problem hiding this comment.
Is it worth documenting what cursor means for the iteration on resume (specifically that it represents the last successfully processed row?)
There was a problem hiding this comment.
Yeah it would make sense
|
I tested locally inside an csv = [['a', 'a', 'a'], ['bb', 'bb', 'bb'], ['c']]
# Lets start the iteration, cursor is nil
# nil.to_i gets converted to 0
enum = csv.each_with_index.drop(0).to_enum
=> #<Enumerator: [[["a", "a", "a"], 0], [["bb", "bb", "bb"], 1], [["c"], 2]]:each>
# We iterate and the job gets halted after processing element 1
# So now cursor is 1
# If we fetch 1 as it is it will return the previously processed element
enum = csv.each_with_index.drop(1).to_enum
=> #<Enumerator: [[["bb", "bb", "bb"], 1], [["c"], 2]]:each>
# But if we apply the change from the PR returns the right elements
enum = csv_dup.each_with_index.drop(2).to_enum
=> #<Enumerator: [[["c"], 2]]:each>Great work 🎉 |
c9e8aa3 to
cfdd449
Compare
I'm addressing an off-by-one-error in the iteration of the CsvEnumerator. When the job would resume after interruption, it would begin by re-processing the record that was just sucessfully imported prior to interruption. That is, the job would re-process the last iteration instead of the moving onto the next iteration as expected. This problem is happening because the CsvEnumerator passes the given cursor position directly into #drop without incrementing it first. not the index of the element we want to drop. Since we just pass in the index of the last sucessful iteration and #drop isn't positional, we end up removing all the elements before the last iteration, but not the last iteration itself. - I handle the cursor in the same way as the #build_array_enumerator. I increment the cursor by 1 before passing it to #drop, unless the cursor is nil, in which case 0 is passed instead
cfdd449 to
d89364d
Compare
What are you trying to accomplish?
I'm addressing an off-by-one-error in the iteration of the
CsvEnumerator. I noticed the bug when I was usingJobIteration::CsvEnumerator.new(csv).rows(cursor: cursor)in a job that imports records. When the job would resume after interruption, it would begin by re-processing the record that was just sucessfully imported prior to interruption. That is, the job would re-process the last iteration instead of the moving onto the next iteration as expected. This problem is happening because theCsvEnumeratorpasses the givencursorposition directly into#dropwithout incrementing it first.#dropexpects the number of elements to remove from an array, not the index of the element we want to drop. Since we just pass in the index of the last sucessful iteration and#dropisn't positional, we end up removing all the elements before the last iteration, but not the last iteration itself.I noticed the
batchesversion of theCsvEnumeratorhad the same problem, so I went ahead and updated that method and its tests too.A quick shoutout to @alanly for working with me on this and spotting the error in the code 🙏
What approach did you choose and why?
cursorin the same way as the#build_array_enumerator. I increment the cursor by 1 before passing it to#drop, unless the cursor isnil, in which case0is passed insteadconsiders cursor: nil as the starttests because they're now redundant. We implicitly test that the enumerator properly handlesnilin theyields every record/batch with their cursor positiontestsWhat should reviewers focus on?
Are there any other tests or files that should be updated?