fix - Metric logging work with new monarch API#451
fix - Metric logging work with new monarch API#451felipemello1 merged 10 commits intometa-pytorch:mainfrom
Conversation
…-metric-logging-reland
404ee1d to
00ccf0c
Compare
|
can ignore the unit test failing. CI is using old monarch version. Error is harmless too. Tests pass locally. |
| ).as_service() | ||
| generator = await GeneratorActor.options( | ||
| **service_config, mesh_name="GeneratorActor" | ||
| ).as_service() |
There was a problem hiding this comment.
What's the purpose of mesh_name here? Is it for the better display on wandb? Why is it not applied to all main files?
There was a problem hiding this comment.
| ) | ||
| await global_logger.register_fetcher.call_one(local_fetcher_actor, proc) | ||
| # Generate a unique ID to map procmesh to fetcher | ||
| proc._uid = str(uuid.uuid4()) |
There was a problem hiding this comment.
do you mind approving it?
There was a problem hiding this comment.
I am a little concerning about the broken CI. Wouldn't it cause all the subsequent commits to break as well?
There was a problem hiding this comment.
I dont think that errors are related to this PR. But let me confirm by opening a dummy PR
| # Deregister local logger from global logger | ||
| if hasattr(proc_mesh, "_local_fetcher"): | ||
| # Deregister LocalFetcherActor from GlobalLoggingActor | ||
| if hasattr(proc_mesh, "_local_fetcher") and hasattr(proc_mesh, "_uid"): |
There was a problem hiding this comment.
Is it possible for a proc_mesh that has _local_fetcher but not _uid?
There was a problem hiding this comment.
no, they should always have both. I guess i was having extra safe here. Is it confusing?
There was a problem hiding this comment.
I understand you write it like this to be safe. But I just worry it may hide some potential errors. How about raise an error if it has _local_fetcher but not _uid?
Co-authored-by: Felipe Mello <[email protected]>
Co-authored-by: Felipe Mello <[email protected]>
We landed #351, which adds better rank names to logging (later reverted in #429). This adds 351 back, but with fixes for monarch.
Goal of the original PR:

Monarch changed APIs, breaking the code in two ways:
1)
Situation: To get the
actor_nameandProcMeshuid, i was usingmonarch.actor.context().Problem: Now actor_name returns
{actor_name}_{ProcMeshuid}(orclientif outside a PrcoMesh), and calling context.world_name errors if called outside of aProcMeshcontext.Fix: Never call
.world_nameand get the uid from actor_name, if any.Situation: To allow
GlobalLoggingActorto access theLocalFetcherActoron eachProcMesh, we would register the (key,value) as{ProcMesh: LocalFetcherActor}Problem: The ProcMesh is not hashable anymore.
Solution: Instead of using ProcMesh as a key, we create a 'str' UUID for the proc.
Extra: updated docstrings to make clear whats going on