Skip to content

Knowing the source site in the aggregation API / aggregate queries need key discovery mechanism #583

@alois-bissuel

Description

@alois-bissuel

Hello,

We have two use cases in advertising which are hard to fit in with the current version of the aggregate API. Advertisers and marketers want to know from which domain conversions were made. For fraud prevention, knowing on which domain clicks were made is paramount for detecting and banishing shady websites being set up for siphoning money off advertiser.

The source site (ie the publisher domain) has been removed in aggregatable reports in #445. Before this pull request (which is trying to solve #439), the source site was available in the clear (as the attribution_destination currently is).

Encoding the publisher domain in the aggregate API in its current state (ie no source_site in the aggregatable reports) is a very hard problem because of its following characteristics:

  • it has a high cardinality (in hundreds of thousands or more, depending on the aggregation window).
  • it is dynamic (any publisher can easily monetize its new website on the open web by plugging it to a SSP).
  • it is hard to know a good a priori estimate of which publishers might lead to conversions (a campaign for high-end headphones might do most of its conversion on a small audiophile blog).

So far, I see three potentials solutions, the first two of which use plausible deniability to add back in clear the source_site to the aggregatable reports:

  1. Include back the source_site in aggregatable reports, and send with some probability empty conversion reports (eg random key and zero value) from any website the user has visited. This might enable the exfiltration of even more user data than before (a very targeted campaign will allow a bad party to gain knowledge of the browsing habit of the targeted user group). Hence the second proposition.
  2. Same as 1., but using a domain picked from a list of the most visited publishers of the country or region. This list can be generated in a privacy-safe and decentralized manner using a mechanism such as RAPPOR.
  3. Adding a mechanism for key retrieval or discovery in the aggregation service. The issue there is that encoding the domain takes a large number of bits, ranging from 20 bits to encode a million different domains using a dictionary, which is the best encoding in term of space, till 5 bits per character (using the lower case Latin alphabet) where the full 128 bits key space might not be large enough to encode many long domains. This type of mechanism might be useful for any dimension with a large cardinality.

What are your suggestions?

N.B. This issue also concerns the Private Aggregation API, as it uses the same report format as ARA for slightly different use cases. Cross-posting a very similar issue there.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions