Skip to content

[RFC] kernel/process: switch ProcessId to use u64 unique identifier#4777

Draft
lschuermann wants to merge 1 commit intomasterfrom
dev/process-id-u64
Draft

[RFC] kernel/process: switch ProcessId to use u64 unique identifier#4777
lschuermann wants to merge 1 commit intomasterfrom
dev/process-id-u64

Conversation

@lschuermann
Copy link
Copy Markdown
Member

@lschuermann lschuermann commented Apr 1, 2026

Pull Request Overview

This switches the unique numeric identifier of a process instance within the ProcessId type (colloquially also called process ID) to be an 8-byte unsigned integer, instead of usize. Additionally, it changes the kernel to panic on the offchance that the counter used for assigning new process IDs rolls over.

This change is motivated by recent discussions around new IPC mechanisms. These mechanisms consist of a "discovery" phase, where applications search for, and potentially authenticate a given process, and a "communication" phase, where they actually exchange data between processes. The discovery phase emits a unique process handle that a subsequent communication phase will use to uniquely identify a previously discovered and authenticated process. Virtually the only good candidate for such a handle on a Tock system is the numeric process ID, as it refers to a particular instance of an application, and does not require us to keep track of extra state or mappings.

However, depending on which architecture we are running on, this numeric process ID will be generated based on a counter that is either 4 or 8 byte wide (usize). While Tock systems generally are limited in the number of applications they can concurrently run, this identifier instead refers to an instance of a process. As such, this counter can be controlled and increased by processes themselves. If this counter is based on a 4 byte value it is unlikely, but not impossible, that malicious or colluding processes could force it to overflow to a known, previous assigned value within a somewhat reasonable timeframe (multiple days).

We should not prevent such overflows by panicing the kernel, or preventing process restarts, as that endangers availability of the system, esp. when deployed in critical and long running scenarios.

At the same time, while unlikely, explicitly permitting counter overflows forces us to think about, model, and either handle or explicitly ignore many types of attacks (e.g., confused deputy) that could arise from such a forced counter overflow. This is not just affecting IPC, but any kernel or userspace subsystem that explicitly or implicitly relies on process ID uniqueness.

This PR instead proposes to make this numeric counter consistent across platforms and debug/release builds: it is always an 8 byte unsigned integer, and in contrast to the previous implementation (which only panics on debug builds) an overflow of this counter always panics the kernel. However, given the astronomical size of this counter, it is virtually impossible that a correct kernel implementation will ever reach this case.

This does come with some drawbacks: I don't believe performance will be impacted significantly (comparsion and increments of 64 bit numbers is decently efficient even on 32 bit systems). However, it does add a single word of memory everywhere that a Process ID is used. It also forces backwards-incompatible changes to a few userspace APIs (legacy IPC being a notable example, although we're in the process of removing that).

I want to open this RFC to start a discussion about this change, whether or not it is a good idea, and what a possible pathway for integrating it into the kernel could be. This does not actively block any IPC development, but we might design our interfaces to expect that we will be passing around u64 instead of usize values.

Testing Strategy

This PR is not tested.

TODO or Help Wanted

The PR needs

  1. consensus of whether this is a good change we want to (eventually) make
  2. careful review of which existing capsule APIs this would break
  3. evaluation of performance and memory overhead impacts
  4. actual testing

Documentation Updated

  • Updated the relevant files in /docs, or no updates are required.

Formatting

  • Ran make prepush.

AI Use

  • The PR description details my use of AI in the production of the
    code in this PR, if any, and I have manually checked and
    personally certify the entire contents of this PR.

Wrote this myself.

This switches the unique numeric identifier of a process instance within the
ProcessId type (colloquially also called process ID) to be an 8-byte unsigned
integer, instead of usize. Additionally, it changes the kernel to panic on the
offchance that the counter used for assigning new process IDs rolls over.

This change is motivated by recent discussions around new IPC mechanisms. These
mechanisms consist of a "discovery" phase, where applications search for, and
potentially authenticate a given process, and a "communication" phase, where
they actually exchange data between processes. The discovery phase emits a
unique process handle that a subsequent communication phase will use to uniquely
identify a previously discovered and authenticated process. Virtually the only
good candidate for such a handle on a Tock system is the numeric process ID, as
it refers to a particular instance of an application, and does not require us to
keep track of extra state or mappings.

However, depending on which architecture we are running on, this numeric process
ID will be generated based on a counter that is either 4 or 8 byte
wide (`usize`). While Tock systems generally are limited in the number of
applications they can concurrently run, this identifier instead refers to an
_instance_ of a process. As such, this counter can be controlled and increased
by processes themselves. If this counter is based on a 4 byte value it is
unlikely, but not impossible, that malicious or colluding processes could force
it to overflow to a known, previous assigned value within a somewhat reasonable
timeframe (multiple days).

We should not prevent such overflows by panicing the kernel, or preventing
process restarts, as that endangers availability of the system, esp. when
deployed in critical and long running scenarios.

At the same time, while unlikely, explicitly permitting counter overflows forces
us to think about, model, and either handle or explicitly ignore many types of
attacks (e.g., confused deputy) that could arise from such a forced counter
overflow. This is not just affecting IPC, but any kernel or userspace subsystem
that explicitly or implicitly relies on process ID uniqueness.

This instead proposes to make this numeric counter consistent across platforms
and debug/release builds: it is always an 8 byte unsigned integer, and in
constrast to the previous implementation (which only panics on debug builds) an
overflow of this counter always panics the kernel. However, given the
astoronomical size of this counter, it is virtually impossible that a correct
kernel implementation will ever reach this case.

This does come with some drawbacks: I don't believe permformance will be
impacted significantly (comparision and increments of 64 bit numbers is decently
efficient even on 32 bit systems). However, it does add a single word of memory
everywhere that a Process ID is used. It also forces backwards-incompatible
changes to a few userspace APIs (legacy IPC being a notable example, although
we're in the process of removing that).
@github-actions github-actions Bot added the kernel label Apr 1, 2026
@bradjc
Copy link
Copy Markdown
Contributor

bradjc commented Apr 2, 2026

This change is motivated by recent discussions around new IPC mechanisms. These mechanisms consist of a "discovery" phase, where applications search for, and potentially authenticate a given process, and a "communication" phase, where they actually exchange data between processes. The discovery phase emits a unique process handle that a subsequent communication phase will use to uniquely identify a previously discovered and authenticated process.

It's not clear to me why an app would want to use IPC to communicate with a process and not an application. As an app author, I would want to communicate with a service, and I don't care if the service restarts.

Virtually the only good candidate for such a handle on a Tock system is the numeric process ID, as it refers to a particular instance of an application, and does not require us to keep track of extra state or mappings.

I'm surprised you didn't mention ShortId. That would be the intuitive answer in my mind. It seems like you are describing a lot of duplication to our existing AppId mechanism.

I'm wary of adding another state variable in Tock that has security implications.

@brghena
Copy link
Copy Markdown
Contributor

brghena commented Apr 2, 2026

I would want to communicate with a service, and I don't care if the service restarts.

If the service restarts, everything you previous coordinated with it is gone and you've got to restart. So I'd argue that you definitely care.

Similarly if a client restarts, a response destined for it is no longer relevant.

@bradjc
Copy link
Copy Markdown
Contributor

bradjc commented Apr 2, 2026

I would want to communicate with a service, and I don't care if the service restarts.

If the service restarts, everything you previous coordinated with it is gone and you've got to restart. So I'd argue that you definitely care.

Similarly if a client restarts, a response destined for it is no longer relevant.

Well, that's if the service makes it my problem to care. Any maybe I get an error that I need to handle, and the only way to handle that error is to re-initialize, etc. That's OK I suppose.

Maybe I was too quick, we could have services, apps, or processes. Right now we have services, which I think makes sense. I can also see using apps. I don't quite see wanting to use processes, at least not in the Tock use case.

@bradjc
Copy link
Copy Markdown
Contributor

bradjc commented Apr 2, 2026

As such, this counter can be controlled and increased by processes themselves. If this counter is based on a 4 byte value it is unlikely, but not impossible, that malicious or colluding processes could force it to overflow to a known, previous assigned value within a somewhat reasonable timeframe (multiple days).

We do have a policy about what happens on an app crash, and a system worried about this should just not keep restarting any app that crashes too many times.

@bradjc
Copy link
Copy Markdown
Contributor

bradjc commented Apr 2, 2026

However, depending on which architecture we are running on, this numeric process ID will be generated based on a counter that is either 4 or 8 byte wide (usize).

I do think it might make sense to standardize this, and I would choose a u32.

@lschuermann
Copy link
Copy Markdown
Member Author

It's not clear to me why an app would want to use IPC to communicate with a process and not an application. As an app author, I would want to communicate with a service, and I don't care if the service restarts.

Echoing @brghena's point here: we're interested in identifying a particular process instance to keep state related to this instance. While initially motivated by IPC, we also use ProcessId for these purposes in the kernel (e.g., as part of grants).

I'm surprised you didn't mention ShortId. That would be the intuitive answer in my mind. It seems like you are describing a lot of duplication to our existing AppId mechanism.

ShortId and AppId have their purposes, but those relate to applications: they are stable identifiers of an application over time. These can be used to authenticate access to persistent resources, or to verify that a given process instance is indeed a process instance of an authenticated application.

However, for many use-cases (including IPC), ShortId and AppId have two shortcomings:

  1. They describe an application, not a process. There could potentially be multiple process instances belong to the same application, either concurrently or over time.
  2. Partially due to the above, ShortId and AppId cannot be used as efficient "handles" to access state related to a given process identifier.

ProcessId instead is a handle to a given process instance, and as such is somewhat orthogonal to application IDs.

Without going too much into the details of how new IPC systems should work in Tock, they will likely use both AppId and ProcessID. Specifically, they will want to have one stable, persistent, and authenticated identifier uniquely associated with a given application, and one identifier that can be used for actual transactions, referring to a particular process instance, in a way that is efficient to look up.

I'm wary of adding another state variable in Tock that has security implications.

This is exactly what I am trying to bring to our attention here. I'm worried that we are currently using ProcessId assuming uniqueness properties that don't actually exist. Such a mismatch commonly leads to serious security vulnerabilities known as "confused deputy attacks". The current (and likely future) IPC mechanisms are a great example of this, and I'm almost certain we find others in the kernel. This PR is trying to rectify this mismatch between assumed and actual guarantees.

Notably, other larger kernels (like Linux) don't have this issue and can recycle PIDs, because they keep extra, unique state around for tracking shared IPC resources between processes. Naïvely, this is in conflict with Tock's heapless kernel architecture.

We do have a policy about what happens on an app crash, and a system worried about this should just not keep restarting any app that crashes too many times.

I'm not entirely opposed, but we should think about the implications this has on Tock's threat model (specifically concerning availability) if we limit the total number of application (re)starts to u32::MAX, when assuming that we guarantee ProcessId uniqueness.

Instead, I'm more in favor of evaluating the actual overheads of using u64s on 32-bit systems.

@jrvanwhy
Copy link
Copy Markdown
Contributor

jrvanwhy commented Apr 7, 2026

Is it difficult to evaluate the overhead of this change? If some of the boards already build with this PR, then we should be able to check the binary sizes to see how much they've bloated, correct? If the code size bloat isn't significant then I would not expect the runtime increase to be significant.

Unless there's a lot of code size overhead, I think 64-bit is the way to go.

@brghena
Copy link
Copy Markdown
Contributor

brghena commented Apr 8, 2026

Discussed on call today that this does actually have a larger code size impact than expected: ~500 bytes. Leon is going to work on this to see if that can trivially be reduced.

@bradjc
Copy link
Copy Markdown
Contributor

bradjc commented Apr 10, 2026

  1. They describe an application, not a process. There could potentially be multiple process instances belong to the same application, either concurrently

This is not the case.

The Tock kernel ensures that each running process has a unique
application identifier; if two userspace binaries have the same AppID,

2. Partially due to the above, ShortId and AppId cannot be used as efficient "handles" to access state related to a given process identifier.

I disagree. ShortId can be used for that, for example that is precisely how we determine if a process is allowed to access a record in a persistent store.

ProcessId instead is a handle to a given process instance, and as such is somewhat orthogonal to application IDs.

Today, I agree, however, with:

a "discovery" phase, where applications search for, and potentially authenticate a given process...The discovery phase emits a unique process handle that a subsequent communication phase will use to uniquely identify a previously discovered and authenticated process

The separation is much less clear.

I'm wary of adding another state variable in Tock that has security implications.

This is exactly what I am trying to bring to our attention here. I'm worried that we are currently using ProcessId assuming uniqueness properties that don't actually exist. Such a mismatch commonly leads to serious security vulnerabilities known as "confused deputy attacks". The current (and likely future) IPC mechanisms are a great example of this, and I'm almost certain we find others in the kernel. This PR is trying to rectify this mismatch between assumed and actual guarantees.

I don't believe this is quite true. We do guarantee ProcessId will be unique with the panic-on-overflow approach. We do not use ProcessId for policies. We do not use ProcessId for authorization. We do use it for IPC today, but, we know our IPC is broken in many ways, and since it is the subsystem motivating this change, I'm not sure it is a good example of best practice or suggests we should continue using ProcessId in this way.

Instead, I'm more in favor of evaluating the actual overheads of using u64s on 32-bit systems.

I'm not sure I understand the need to allow a process (or processes) to intentionally restart itself (themselves) billions of times. However, if it is important that we just eliminate this concern and we manage the overhead in a way we find suitable, I'm not that opposed to making the counter a u64.

What I am pushing back on is then using that change as an indicator we should use ProcessId as anything security related (broadly construed), adding complexity to the OS both in its implementation, but also in deciding when to use AppId vs. ProcessId, what they actually guarantee (e.g., AppId is unique among running processes), and it being (potentially) easy to use one/both in a way they cannot safely be used.

@lschuermann
Copy link
Copy Markdown
Member Author

lschuermann commented Apr 10, 2026

While AppID requires that no two running processes share the same application ID, it does not uniquely identify a given process. There can be multiple, different process instances spawned by the same application over time, all carrying the same application ID.

As such, while AppID is an appropriate tool to govern access to persistent state shared by all process instances of an application over time, it is not appropriate to address a particular process instance and its inherent, implicit ephemeral state.

However, reliably addressing a particular process instance is important for many mechanisms both in the kernel and in userspace. For example, sensitive kernel APIs to interact with processes (such as scheduling upcalls or terminating them) relate not to applications, but to given process instances, and hence take ProcessIds instead of application IDs.

For example, when a process starts a security-sensitive operation, and a capsule later wants to inform that particular process instance of the result of that operation, it will use its ProcessId to schedule a callback. If it is possible that this ProcessId may now refer to another process instance, perhaps even of another application altogether, then we would leak sensitive data back to another process. Similar issues exist around process management, etc.

This makes ProcessIds security sensitive; we need to carefully reason about which guarantees they give us and about how we use them in practice. I agree that we need to carefully consider when to use AppID or ProcessId, and the relationship between them, as they have fundamentally different use cases, semantics, and guarantees.

As an aside: you might be wondering why other OSes don't run into similar issues. E.g., on Linux, PIDs can be recycled, so why isn't that problematic? The answer is twofold: on the one hand, this is a real problem in practice. PID reuse is a well-known security-relevant issue that causes real vulnerabilities. On the other hand, Linux uses properly unique identifiers within the kernel itself to refer to process-related resources (pointers to kernel objects). In fact, the solution to the aforementioned PID reuse issues is to extend this notion of unique kernel-backed identifiers into userspace in the form of pidfds.

Also, the following is incorrect:

We do guarantee ProcessId will be unique with the panic-on-overflow approach.

Currently, we only (accidentally?) perform a panic-on-overflow in debug builds, while in release builds we silently wrap around.

If we were to commit to a panic-on-overflow approach that guarantees Process ID uniqueness over a kernel instance's lifetime, I want us to be aware of, and document the implications that this has on availability of the Tock kernel, in particular in the face of adversarial processes.

@brghena
Copy link
Copy Markdown
Contributor

brghena commented Apr 10, 2026

I think the following could occur if IPC were to use AppID as a handle:

  1. A service starts with ShortID 1
  2. A process discovers it and starts communicating with it
  3. The service dies
  4. The service restarts and has ShortID 1 again
  5. The process doesn't know that the service has restarted
  6. The service gets a bunch of information that doesn't make sense to it

I think there's also a worse case:

  1. Two services share the same AppID: A and B.
  2. Tock starts service A with ShortID 1
  3. A process discovers the service and starts communicating with it
  4. Service A dies
  5. Tock starts Service B with ShortID 1 again
  6. The process doesn't know that the service has been replaced
  7. Service B gets data that was meant for Service A

Now, that's less bad than it sounds, as for some reason both Service A and B shared an AppID, so presumably we would equally trust them.

@brghena
Copy link
Copy Markdown
Contributor

brghena commented Apr 10, 2026

To be clear, that's not a statement that ProcessIDs are good. If they're problematic enough, we shouldn't use them for IPC. But I think AppID or ShortID doesn't meet our handle needs either, unless I'm mistaken. That's mostly why I'm listing scenarios: so you can correct me if I'm wrong.

@bradjc
Copy link
Copy Markdown
Contributor

bradjc commented Apr 10, 2026

I think the following could occur if IPC were to use AppID as a handle:

1. A service starts with ShortID 1
2. A process discovers it and starts communicating with it
3. The service dies
4. The service restarts and has ShortID 1 again
5. The process doesn't know that the service has restarted
6. The service gets a bunch of information that doesn't make sense to it

Certainly something has to handle process crashes. But rather than discuss ProcessIds vs. ShortIds vs. SomethingElse, I'd be more interested in deciding what conceptual model we are using, and then mapping that to an implementation. I would say currently IPC uses a named service model, and it seems to me like the service itself should be responsible for maintaining continuity of functionality. Or, the restarted service could respond with a "service failed, please restart" error. Alternatively, the kernel knows if a process restarts, and could maybe somehow inject that error on to the using process.

Or, we decide the failure is entirely the problem of the using process.

I think there's also a worse case:

  1. Two services share the same AppID: A and B.

Those are then the same service, without some other identifier for services. This is consistent with IPC today.

@bradjc
Copy link
Copy Markdown
Contributor

bradjc commented Apr 10, 2026

Also, the following is incorrect:

We do guarantee ProcessId will be unique with the panic-on-overflow approach.

Currently, we only (accidentally?) perform a panic-on-overflow in debug builds, while in release builds we silently wrap around.

Ah, yes, thank you for correcting that.

However, reliably addressing a particular process instance is important for many mechanisms both in the kernel and in userspace. For example, sensitive kernel APIs to interact with processes (such as scheduling upcalls or terminating them) relate not to applications, but to given process instances, and hence take ProcessIds instead of application IDs.

We can make this even better by implementing ProcessIds as the tuple of (index, ShortId, identifier). identifier could even be 8 bits, and sure, you could send old data to a new instance of the same app restarted 256 times in the future, but, eh. In practice this is more complicated, as ShortId::LocallyUnique would have to be handled somehow.

I do agree we need ProcessId to be separate from ShortId at the very least to distinguish the grant region from one running process to another (of the same app). However, I would advocate for removing the .id() method from ProcessId, removing the (public) notion that there is any sort of number associated with an ephemeral process. Very physical Apple credit card-esque. This change is actually quite easy, with only a couple debugging capsules (and BLE advertisements, interestingly) that need to change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants