Skip to content

Add a flagging callback to save json files to a hugging face dataset#1821

Merged
abidlabs merged 29 commits into
gradio-app:mainfrom
chrisemezue:main
Aug 23, 2022
Merged

Add a flagging callback to save json files to a hugging face dataset#1821
abidlabs merged 29 commits into
gradio-app:mainfrom
chrisemezue:main

Conversation

@chrisemezue
Copy link
Copy Markdown
Contributor

@chrisemezue chrisemezue commented Jul 18, 2022

Description

Based on issue #1676 I have created the HuggingFaceDatasetJSONSaver class which saves the files as JSONL.

Specifically, for each flagged sample:

  • I create a unique ID (a hash of random numbers and strings) and create a folder with the name of the ID. In the code I call the new folder folder_name.
  • Save the files (images, audio) inside folder_name
  • Save the other details (output, numbers, etc) in a metadata.jsonl file inside the folder_name folder.

Advantages of this:

  • The major advantage is that we bypass the need to read and write to one CSV. Where the advantage of this is useful is if there are three users on their devices simultaneously flagging a sample. With CSV there would be an error because there can't be more than one simultaneous edit to a CSV file. But this way I propose enables parallel flagging.
  • any additional dependencies that are required for this change.
  • no additional dependencies are required. I tried my best to make sure I leverage functions from the original HuggingFaceDatasetSaver.

Closes: #1676

Checklist:

  • [ X] I have performed a self-review of my own code
  • My code follows the style guidelines of this project
  • I have commented my code in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@chrisemezue chrisemezue changed the title work on saving flags in JSON format enable saving flags in parallel Jul 18, 2022
@abidlabs
Copy link
Copy Markdown
Member

Hi @chrisemezue,

This is very cool, two quick high-level questions:

  1. Can you run the formatter on your code? That way, the CI won't complain about the formatting. The easiest way is to run this script:
bash scripts/format_backend.sh
  1. Just to confirm, is this way of saving data into a HuggingFace Dataset (having folders for each sample) compatible with the Dataset previewer on the Hub?

@osanseviero
Copy link
Copy Markdown
Contributor

osanseviero commented Jul 19, 2022

I think the title of this PR is a bit misleading. HuggingFaceDatasetSaver will still not work in parallel afaik. For me it's a bit confusing if this is saving a data.csv or a json based on the name (HuggingFaceDatasetJSONSaver). I see log_file uses csv but that does not seem to be used anywhere. If it's a csv, should we directly update HuggingFaceDatasetJSONSaver?

@chrisemezue chrisemezue changed the title enable saving flags in parallel Add a flagging callback to save json files to a hugging face dataset Jul 19, 2022
@chrisemezue
Copy link
Copy Markdown
Contributor Author

I think the title of this PR is a bit misleading. HuggingFaceDatasetSaver will still not work in parallel afaik. For me it's a bit confusing if this is saving a data.csv or a json based on the name (HuggingFaceDatasetJSONSaver). I see log_file uses csv but that does not seem to be used anywhere. If it's a csv, should we directly update HuggingFaceDatasetJSONSaver?

Thanks @osanseviero for your feedback

  • I will remove the self.log_file and other redundant variables not used.
  • the class HuggingFaceDatasetJSONSaver is just saving the flagged samples to a jsonl format instead of csv. Is there a better name you suggest?
  • I changed the name of PR to the name of its issue. If you have a better easier to understand suggestion I will love to hear.

@osanseviero
Copy link
Copy Markdown
Contributor

Hey @chrisemezue, thank you! I was a bit confused by the mentions of csvs, but now that you mention it's a json then it's great! Thanks!

@osanseviero osanseviero self-requested a review July 19, 2022 20:14
@osanseviero
Copy link
Copy Markdown
Contributor

Please let us know whenever this is ready for review :)

@chrisemezue
Copy link
Copy Markdown
Contributor Author

@osanseviero I am done now. Ready for review.

Copy link
Copy Markdown
Contributor

@osanseviero osanseviero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for this! This is very cool! I left some minor comments 🤗

I would love to see an example output of using this flagging callback (a small dataset, since https://huggingface.co/datasets/chrisjay/crowd-speech-africa has too many files and it does not load :()).

cc @lhoestq

Comment thread gradio/flagging.py Outdated
Comment thread gradio/flagging.py Outdated
Comment thread gradio/flagging.py
Comment thread gradio/flagging.py
Comment thread gradio/flagging.py
Comment thread gradio/flagging.py Outdated
Comment thread gradio/flagging.py Outdated
Comment thread gradio/flagging.py Outdated
Comment thread gradio/flagging.py Outdated
@chrisemezue
Copy link
Copy Markdown
Contributor Author

@osanseviero here is an example of a small dataset with this flagging callback.

Comment thread gradio/flagging.py Outdated
Comment thread gradio/flagging.py Outdated
Comment thread gradio/flagging.py Outdated
Comment thread gradio/flagging.py Outdated
Comment thread gradio/flagging.py
@abidlabs
Copy link
Copy Markdown
Member

Hi @chrisemezue this looks really good! I left some suggestions / clarification questions in the PR, but once these are addressed, we should be good to merge

@abidlabs
Copy link
Copy Markdown
Member

Pushed some changes which should fix the tests. As discussed over Slack, we just have a couple of minor fixes, and then we should be good to merge!

@abidlabs
Copy link
Copy Markdown
Member

Thanks so much @chrisemezue for making the PR and addressing the suggestions! And thanks all for reviewing.

LGTM -- will merge in after the tests run

Copy link
Copy Markdown
Contributor

@osanseviero osanseviero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good! Thanks for working on this!

Comment thread gradio/flagging.py

for component in components:
headers.append(component.label)
headers.append(component.label)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is repeated above, is that intended?

Comment thread gradio/flagging.py Outdated
Comment thread gradio/flagging.py Outdated
@abidlabs
Copy link
Copy Markdown
Member

I'll resolve the conflicts and fix the last few suggestions you made @osanseviero, so that we can get this merged in.

Thanks a bunch @chrisemezue!

@abidlabs abidlabs merged commit 9c4dc6c into gradio-app:main Aug 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add a flagging callback to save json files to a hugging face dataset

5 participants