feat: implement parallel operations#2067
feat: implement parallel operations#2067ddelgrosso1 merged 25 commits intogoogleapis:transfer-managerfrom
Conversation
|
Opening as a draft PR while I add in JSDoc comments so that I can start gathering feedback / initial thoughts. |
|
Repointing this at a new branch named |
danielduhh
left a comment
There was a problem hiding this comment.
Just a couple passthrough comments/questions :)
3e26dd8 to
ad6aadd
Compare
ad6aadd to
1cc8bae
Compare
* fix: more work on transfer manager perf metrics * fix unit tests for tm
* refactor: refactor constants * bug fixes * bug fixes * bug fixes
81f7850 to
c153ab6
Compare
shaffeeullah
left a comment
There was a problem hiding this comment.
Few comments, but generally looks good. Thanks, Denis!!
| * ``` | ||
| * @experimental | ||
| */ | ||
| async downloadManyFiles( |
There was a problem hiding this comment.
do other lanauges have this as a list of objects? or is a list of strings also an option?
There was a problem hiding this comment.
Python does it as a list of names. If desired we could move to that as well. Thoughts?
There was a problem hiding this comment.
The two use cases are
- download "directory"
- download list of files
Not sure which is more common (question for @domZippilli) but it would be great to support both (or show users how to do both)
There was a problem hiding this comment.
I've modified the signature to now accept an array of files, an array of strings, or a single string. If a single string is provided it will be treated as a GCS prefix to call getFiles. Utilizing this prefix will allow for an entire folder structure to be downloaded.
There was a problem hiding this comment.
That's probably the right idea. It sounds like an idiomatic variation on what I wrote before I saw @ddelgrosso1's reply.
A directory is probably the most common set of files people upload. But of course you'd want to think about this as a list of files to be extensible. To me, the logical progression is something like "I want to upload a directory" -> "I want to form a glob of file names" -> "I want to upload a list of files". So solving the last problem lets you frame the others as variations on it. For example, make a convenience method that accepts a directory name, and just constructs the list of files to pipe into the actual implementation.
| assert.strictEqual(count, paths.length); | ||
| }); | ||
|
|
||
| it('sets ifGenerationMatch to 0 if skipIfExists is set', async () => { |
There was a problem hiding this comment.
is there a test for skipIfExists set to true, the file exists, and it errors? We should verify that it doesnt stop the execution of the code and just moves onto the next file.
There was a problem hiding this comment.
I'm not sure how to best handle this as during a unit test since we aren't calling the actual service.
| * ``` | ||
| * @experimental | ||
| */ | ||
| async downloadManyFiles( |
There was a problem hiding this comment.
The two use cases are
- download "directory"
- download list of files
Not sure which is more common (question for @domZippilli) but it would be great to support both (or show users how to do both)
|
Here is the summary of changes. You are about to add 3 region tags.
This comment is generated by snippet-bot.
|
danielbankhead
left a comment
There was a problem hiding this comment.
Looks good, a few thoughts/comments. Gonna play with it some more tomorrow.
| * limitations under the License. | ||
| */ | ||
|
|
||
| // eslint-disable-next-line node/no-unsupported-features/node-builtins |
There was a problem hiding this comment.
Is this comment still needed?
There was a problem hiding this comment.
Yes, this file requires worker_theads and not all features were available in the initial release of v12. If I remove this the CI pipeline starts to complain if I remember correctly.
| * @experimental | ||
| */ | ||
| async uploadManyFiles( | ||
| filePaths: string[], |
There was a problem hiding this comment.
[Optional] We have an opportunity to get fancy here to really emphasize throughput; what if we allowed iterators and async iterators?
Use cases:
- walking large directory you want to walk asynchronously (reduces memory usage)
- situations where one may parse a list of files in chunks and want to chain operations together (such as an HTTP server receiving incoming data)
There was a problem hiding this comment.
We can probably wait until customers ask for this functionality though.
There was a problem hiding this comment.
It is a good idea, maybe after the initial release we can circle up. I would definitely be interested in your thoughts on optimizations here.
| async downloadFileInChunks( | ||
| fileOrName: File | string, | ||
| options: DownloadFileInChunksOptions = {} | ||
| ): Promise<void | DownloadResponse> { |
There was a problem hiding this comment.
What are your thoughts on data validation with this method? Since file.download({start, end}) skips validation, we may want to add support or mention it in the description.
There was a problem hiding this comment.
Good question... I will definitely add a callout in the comments that this will skip validation. Do we currently have a means of validating a file after it has been downloaded? I know you did a bunch of work in this area but I thought it only worked with streamed input?
There was a problem hiding this comment.
I believe it's technically possible to CRC32C chunks and then "sum" them into the final checksum, such that you could do this kind of chunked operation and still checksum without an additional byte scan. If you need MD5 or can't do the previous algorithm, I think validation would require a full read of the completed file.
There was a problem hiding this comment.
Do we currently have a means of validating a file after it has been downloaded?
Yep (although we may be able to do this as we're receiving each chunk):
Line 257 in a61e619
I know you did a bunch of work in this area but I thought it only worked with streamed input?
It's possible to use CRC32C with/out streams (HashStreamValidator is the class that wraps CRC32C for streams):
Line 134 in a61e619
nodejs-storage/src/hash-stream-validator.ts
Lines 30 to 34 in a61e619
There was a problem hiding this comment.
Are we comfortable with only allowing CRC32C. In my mind something is better than nothing... I can take a look at implementing it if we feel strongly this should be included.
| const fileToWrite = await fsp.open(filePath, 'w+'); | ||
| while (start < size) { | ||
| const chunkStart = start; | ||
| let chunkEnd = start + chunkSize - 1; | ||
| chunkEnd = chunkEnd > size ? size : chunkEnd; | ||
| promises.push( | ||
| limit(() => | ||
| file.download({start: chunkStart, end: chunkEnd}).then(resp => { | ||
| return fileToWrite.write(resp[0], 0, resp[0].length, chunkStart); | ||
| }) | ||
| ) | ||
| ); | ||
|
|
||
| start += chunkSize; | ||
| } |
There was a problem hiding this comment.
What's stopping the chunks from being written out of order?
There was a problem hiding this comment.
If I did this correctly the position in the file of where to write this chunk is handled by the chunkStart parameter. Basically this call should write the incoming chunk with no offset in the chunk buffer to the file offset specified by chunkStart. Was your understanding different?
There was a problem hiding this comment.
I have a slightly different understanding here (I could be wrong), but would it be possible for subsequent chunks to download before proceeding chunks and write to the file first?
Say we have a 65MB file and a 64MB chunk size (making 2 chunks to download). Since the 2nd chunk is 1MB and would download first, wouldn't it write to the file before the first chunk?
There was a problem hiding this comment.
It shouldn't matter if they are written in order. This implementation is leaning on a filesystem supporting sparse files. So, imagine you wrote these back-to-front; the first chunk will seek to its chunkStart near the end of the file, and the OS will logically pad that first byte with 0s up to the seek point.
There was a problem hiding this comment.
^that's the missing piece for me, thanks Dom!
There was a problem hiding this comment.
As Dom outlined this was the behavior I was intending. It shouldn't matter which chunk downloads first they should get written to their correct place in the resultant file. Ultimately if all succeed, the file should be correctly "reassembled".
|
Since this is getting large and most feedback has been addressed going to go ahead and merge this into the staging branch. We can re-review before the final merge to main. |
* feat: implement parallel operations * add more parallel operations * add header to test file * update import of fs/promises * fix pathing on windows, fix mocking of fs promises * add jsdoc headers to class and uploadMulti * add jsdoc comments to remaining functions * update comment wording * add experimental jsdoc tags * feat: add directory generator to performance test framework * clarify variable names and comments * capitalization * wip: transfer manager performance tests * feat: merged in application performance tests (googleapis#2100) * fix: fixed many bugs (googleapis#2102) * fix: cleaning up bugs * fix: fixed many bugs * fix: more work on transfer manager perf metrics (googleapis#2103) * fix: more work on transfer manager perf metrics * fix unit tests for tm * fix: performance test refactoring, comments (googleapis#2104) * refactor: refactor constants (googleapis#2105) * refactor: refactor constants * bug fixes * bug fixes * bug fixes * linter fixes, download to disk for performance test * rename transfer manager functions * remove callbacks from transfer manager * add more experimental tags, update comments * change signature of downloadManyFiles to accept array of strings or a folder name * linter fix * add transfer manager samples and samples tests Co-authored-by: Sameena Shaffeeullah <shaffeeullah@google.com>
* feat: implement parallel operations * add more parallel operations * add header to test file * update import of fs/promises * fix pathing on windows, fix mocking of fs promises * add jsdoc headers to class and uploadMulti * add jsdoc comments to remaining functions * update comment wording * add experimental jsdoc tags * feat: add directory generator to performance test framework * clarify variable names and comments * capitalization * wip: transfer manager performance tests * feat: merged in application performance tests (googleapis#2100) * fix: fixed many bugs (googleapis#2102) * fix: cleaning up bugs * fix: fixed many bugs * fix: more work on transfer manager perf metrics (googleapis#2103) * fix: more work on transfer manager perf metrics * fix unit tests for tm * fix: performance test refactoring, comments (googleapis#2104) * refactor: refactor constants (googleapis#2105) * refactor: refactor constants * bug fixes * bug fixes * bug fixes * linter fixes, download to disk for performance test * rename transfer manager functions * remove callbacks from transfer manager * add more experimental tags, update comments * change signature of downloadManyFiles to accept array of strings or a folder name * linter fix * add transfer manager samples and samples tests Co-authored-by: Sameena Shaffeeullah <shaffeeullah@google.com>
* feat: implement parallel operations * add more parallel operations * add header to test file * update import of fs/promises * fix pathing on windows, fix mocking of fs promises * add jsdoc headers to class and uploadMulti * add jsdoc comments to remaining functions * update comment wording * add experimental jsdoc tags * feat: add directory generator to performance test framework * clarify variable names and comments * capitalization * wip: transfer manager performance tests * feat: merged in application performance tests (#2100) * fix: fixed many bugs (#2102) * fix: cleaning up bugs * fix: fixed many bugs * fix: more work on transfer manager perf metrics (#2103) * fix: more work on transfer manager perf metrics * fix unit tests for tm * fix: performance test refactoring, comments (#2104) * refactor: refactor constants (#2105) * refactor: refactor constants * bug fixes * bug fixes * bug fixes * linter fixes, download to disk for performance test * rename transfer manager functions * remove callbacks from transfer manager * add more experimental tags, update comments * change signature of downloadManyFiles to accept array of strings or a folder name * linter fix * add transfer manager samples and samples tests Co-authored-by: Sameena Shaffeeullah <shaffeeullah@google.com>
* feat: implement parallel operations (#2067) * feat: implement parallel operations * add more parallel operations * add header to test file * update import of fs/promises * fix pathing on windows, fix mocking of fs promises * add jsdoc headers to class and uploadMulti * add jsdoc comments to remaining functions * update comment wording * add experimental jsdoc tags * feat: add directory generator to performance test framework * clarify variable names and comments * capitalization * wip: transfer manager performance tests * feat: merged in application performance tests (#2100) * fix: fixed many bugs (#2102) * fix: cleaning up bugs * fix: fixed many bugs * fix: more work on transfer manager perf metrics (#2103) * fix: more work on transfer manager perf metrics * fix unit tests for tm * fix: performance test refactoring, comments (#2104) * refactor: refactor constants (#2105) * refactor: refactor constants * bug fixes * bug fixes * bug fixes * linter fixes, download to disk for performance test * rename transfer manager functions * remove callbacks from transfer manager * add more experimental tags, update comments * change signature of downloadManyFiles to accept array of strings or a folder name * linter fix * add transfer manager samples and samples tests Co-authored-by: Sameena Shaffeeullah <shaffeeullah@google.com> * 🦉 Updates from OwlBot post-processor See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md * feat: add crc32c validation option to chunked download, cleanup naming (#2110) * feat: add crc32c validation option to chunked download, cleanup naming in perf tests * close file in finally block Co-authored-by: Sameena Shaffeeullah <shaffeeullah@google.com> Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com>
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
Fixes #1868 🦕
Fixes #2051