Knowing the source site in the aggregation API / aggregate queries need key discovery mechanism

Hello,

We have two use cases in advertising which are hard to fit in with the current version of the aggregate API. Advertisers and marketers want to know from which domain conversions were made. For fraud prevention, knowing on which domain clicks were made is paramount for detecting and banishing shady websites being set up for siphoning money off advertiser.

The source site (ie the publisher domain) has been removed in aggregatable reports in #445. Before this pull request (which is trying to solve #439), the source site was available in the clear (as the `attribution_destination` currently is).

Encoding the publisher domain in the aggregate API in its current state (ie no `source_site` in the aggregatable reports) is a very hard problem because of its following characteristics:

- it has a high cardinality (in hundreds of thousands or more, depending on the aggregation window).
- it is dynamic (any publisher can easily monetize its new website on the open web by plugging it to a SSP).
- it is hard to know a good a priori estimate of which publishers might lead to conversions (a campaign for high-end headphones might do most of its conversion on a small audiophile blog).

So far, I see three potentials solutions, the first two of which use plausible deniability to add back in clear the `source_site` to the aggregatable reports:

1. Include back the `source_site` in aggregatable reports, and send with some probability empty conversion reports (eg random key and zero value) from any website the user has visited. This might enable the exfiltration of even more user data than before (a very targeted campaign will allow a bad party to gain knowledge of the browsing habit of the targeted user group). Hence the second proposition.
2. Same as 1., but using a domain picked from a list of the most visited publishers of the country or region. This list can be generated in a privacy-safe and decentralized manner using a mechanism such as [RAPPOR](https://arxiv.org/pdf/1407.6981.pdf).
3. Adding a mechanism for key retrieval or discovery in the aggregation service. The issue there is that encoding the domain takes a large number of bits, ranging from 20 bits to encode a million different domains using a dictionary, which is the best encoding in term of space, till 5 bits per character (using the lower case Latin alphabet) where the full 128 bits key space might not be large enough to encode many long domains. This type of mechanism might be useful for any dimension with a large cardinality.

What are your suggestions?

N.B. This issue also concerns the [Private Aggregation API](https://github.com/patcg-individual-drafts/private-aggregation-api), as it uses the same report format as ARA for slightly different use cases. Cross-posting a very similar issue there.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Knowing the source site in the aggregation API / aggregate queries need key discovery mechanism #583

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Knowing the source site in the aggregation API / aggregate queries need key discovery mechanism #583

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions