Add a flagging callback to save json files to a hugging face dataset by chrisemezue · Pull Request #1821 · gradio-app/gradio

chrisemezue · 2022-07-18T22:22:02Z

Description

Based on issue #1676 I have created the HuggingFaceDatasetJSONSaver class which saves the files as JSONL.

Specifically, for each flagged sample:

I create a unique ID (a hash of random numbers and strings) and create a folder with the name of the ID. In the code I call the new folder folder_name.
Save the files (images, audio) inside folder_name
Save the other details (output, numbers, etc) in a metadata.jsonl file inside the folder_name folder.

Advantages of this:

The major advantage is that we bypass the need to read and write to one CSV. Where the advantage of this is useful is if there are three users on their devices simultaneously flagging a sample. With CSV there would be an error because there can't be more than one simultaneous edit to a CSV file. But this way I propose enables parallel flagging.

any additional dependencies that are required for this change.

no additional dependencies are required. I tried my best to make sure I leverage functions from the original HuggingFaceDatasetSaver.

Closes: #1676

Checklist:

[ X] I have performed a self-review of my own code
My code follows the style guidelines of this project
I have commented my code in hard-to-understand areas
I have made corresponding changes to the documentation
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

abidlabs · 2022-07-19T03:34:18Z

Hi @chrisemezue,

This is very cool, two quick high-level questions:

Can you run the formatter on your code? That way, the CI won't complain about the formatting. The easiest way is to run this script:

bash scripts/format_backend.sh

Just to confirm, is this way of saving data into a HuggingFace Dataset (having folders for each sample) compatible with the Dataset previewer on the Hub?

osanseviero · 2022-07-19T11:16:26Z

I think the title of this PR is a bit misleading. HuggingFaceDatasetSaver will still not work in parallel afaik. For me it's a bit confusing if this is saving a data.csv or a json based on the name (HuggingFaceDatasetJSONSaver). I see log_file uses csv but that does not seem to be used anywhere. If it's a csv, should we directly update HuggingFaceDatasetJSONSaver?

chrisemezue · 2022-07-19T17:35:24Z

I think the title of this PR is a bit misleading. HuggingFaceDatasetSaver will still not work in parallel afaik. For me it's a bit confusing if this is saving a data.csv or a json based on the name (HuggingFaceDatasetJSONSaver). I see log_file uses csv but that does not seem to be used anywhere. If it's a csv, should we directly update HuggingFaceDatasetJSONSaver?

Thanks @osanseviero for your feedback

I will remove the self.log_file and other redundant variables not used.
the class HuggingFaceDatasetJSONSaver is just saving the flagged samples to a jsonl format instead of csv. Is there a better name you suggest?
I changed the name of PR to the name of its issue. If you have a better easier to understand suggestion I will love to hear.

osanseviero · 2022-07-19T20:14:04Z

Hey @chrisemezue, thank you! I was a bit confused by the mentions of csvs, but now that you mention it's a json then it's great! Thanks!

osanseviero · 2022-07-21T10:39:34Z

Please let us know whenever this is ready for review :)

chrisemezue · 2022-07-25T20:34:46Z

@osanseviero I am done now. Ready for review.

osanseviero

Thanks a lot for this! This is very cool! I left some minor comments 🤗

I would love to see an example output of using this flagging callback (a small dataset, since https://huggingface.co/datasets/chrisjay/crowd-speech-africa has too many files and it does not load :()).

cc @lhoestq

chrisemezue · 2022-08-08T19:15:09Z

@osanseviero here is an example of a small dataset with this flagging callback.

abidlabs · 2022-08-11T05:57:08Z

Hi @chrisemezue this looks really good! I left some suggestions / clarification questions in the PR, but once these are addressed, we should be good to merge

abidlabs · 2022-08-11T18:12:54Z

Pushed some changes which should fix the tests. As discussed over Slack, we just have a couple of minor fixes, and then we should be good to merge!

abidlabs · 2022-08-12T03:07:32Z

Thanks so much @chrisemezue for making the PR and addressing the suggestions! And thanks all for reviewing.

LGTM -- will merge in after the tests run

osanseviero

This looks good! Thanks for working on this!

osanseviero · 2022-08-22T10:03:25Z

+
+        for component in components:
+            headers.append(component.label)
+            headers.append(component.label)


This is repeated above, is that intended?

abidlabs · 2022-08-23T22:42:49Z

I'll resolve the conflicts and fix the last few suggestions you made @osanseviero, so that we can get this merged in.

Thanks a bunch @chrisemezue!

work on saving flags in JSON format

621acee

chrisemezue changed the title ~~work on saving flags in JSON format~~ enable saving flags in parallel Jul 18, 2022

explained what I did more clearly

1dcc0ad

chrisemezue changed the title ~~enable saving flags in parallel~~ Add a flagging callback to save json files to a hugging face dataset Jul 19, 2022

osanseviero self-requested a review July 19, 2022 20:14

final updates + added test case

abf5bdc

osanseviero reviewed Jul 27, 2022

View reviewed changes

lhoestq reviewed Jul 27, 2022

View reviewed changes