Skip to content

parallel json parsing#33

Merged
subwaystation merged 7 commits intograph-genome:masterfrom
dimatr:master
Apr 10, 2020
Merged

parallel json parsing#33
subwaystation merged 7 commits intograph-genome:masterfrom
dimatr:master

Conversation

@dimatr
Copy link
Copy Markdown
Collaborator

@dimatr dimatr commented Apr 8, 2020

On a 2.2 GB test this drops the JSONparser time from 3 min to 30 sec

@josiahseaman josiahseaman self-requested a review April 8, 2020 21:29
Copy link
Copy Markdown
Member

@josiahseaman josiahseaman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

python -m pip install ray
Collecting ray
ERROR: Could not find a version that satisfies the requirement ray (from versions: none)
ERROR: No matching distribution found for ray

There's clearly a repo here: https://pypi.org/project/ray/ I've updated pip. Perhaps this is pypy only distro? Could you please give me more info on ray install compatibility and whether we'll be able to build it into the requirements.txt or some other automated solution?

@josiahseaman
Copy link
Copy Markdown
Member

The issue appears to be that wheels are only built for Linux and MacOS, but I'm testing on a Windows machine. The lack of support makes sense really, but it should at least be noted this would be the first departure of Pantograph from being cross-platform compatible. Would you mind writing an OS switch and local import instead?

https://ray.readthedocs.io/en/latest/installation.html

@lomereiter
Copy link
Copy Markdown
Contributor

Also consider using an OS-independent library (joblib?)

@subwaystation
Copy link
Copy Markdown
Member

Do we really need support for Windows?
There is no Windows support for e.g. odgi, too.
We will have our docker pipeline, so as long as docker is available, the whole thing will work out.

@dimatr
Copy link
Copy Markdown
Collaborator Author

dimatr commented Apr 9, 2020

Please check now. The current change works on Linux and Windows

@subwaystation
Copy link
Copy Markdown
Member

Reading in the data on a 28 core machine is much faster now! Thanks @dimatr .

Just one minor thing: How about a parameter where users can specify the number of cores to use?

@dimatr
Copy link
Copy Markdown
Collaborator Author

dimatr commented Apr 9, 2020

I will add one new parameter --parallel-cores with default os.cpu_count()

@dimatr
Copy link
Copy Markdown
Collaborator Author

dimatr commented Apr 9, 2020

should be ready now, please check

@subwaystation subwaystation merged commit 8bc49fb into graph-genome:master Apr 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants