I spent a little time last week digging into Google's FLoC whitepaper to better understand the proposal and how it would impact Narrative.  I'll probably write a more thorough post on the corporate blog with some of the high level takeaways, but I learned some things technically that I thought were worth writing about here.

Note:  While I have a degree in computer science, I would not consider myself a computer scientists so take any of the technical bits here with a grain of salt.  I might be totally full of shit.

FLoC means Federated Learning of Cohorts and is an API being proposed by Google that would allow companies to use data to target users, while not exposing anything identifying to the company.  As the whitepaper lays out the initiative is really about balancing the utility of data with the privacy of the users that data has been collected on.  It is part of the larger Google initiative in Chrome called the Privacy Sandbox.  

In short FLoC proposes a way that allow data about users to be collected and used while preserving k-anonymity.  k-anonymity was defined in 1998:

"Given person-specific field-structured data, produce a release of the data with scientific guarantees that the individuals who are the subjects of the data cannot be re-identified while the data remain practically useful."

Data is said to be k-anonymous if the data on any one user is indistinguishable from at least k-1 other users.  

FLoC uses a number techniques to achieve k-anonymity while claiming to preserve the utility of the underlying data.

Chrome as the Write-only Data Store

Traditionally data that has been collected about the user has been stored in cookies or server side by the company collecting the data.  The data was keyed off of a pseudo-anonymous identifier that was tied to the user and/or their browser.

The Privacy Sandbox changes this by allowing the data to be written to Chrome, but without allowing the data to be read back out in its original form.  The browser effectively becomes the unique identifier, but the identifier or the data collected isn't exposed back to the company that is collecting it (or their partners)

Cohort Detection

With the data stored locally there needs to be a mechanism that both provides k-anonymity while retaining as much utility – which is usually measured in advertising effectiveness.  The first step in this is trying to determine which users are similar to each other.  This "cohort detection" has typically been done server-side, and logically one would think that a central service to compare users would be a pre-requisite, but using a technique called SimHash, this can be done locally with no shared user data repository.

The magic of SimHash is to create a cohort ID based on the local data alone, but to ensure that anyone with the same cohort ID is similar based on the data collected that has been collected about them.  Further cohort IDs that are near each other are also likely to be more similar to each other than cohort IDs that are further away.  The idea is that instead of exposing the underlying data, Chrome will allow companies to see which cohort a user is in.

The cohort ID itself is a binary string.  The string itself can be any length. The longer the string the more precise the cohorts.  For the following examples we'll assume the string with three bits (8 total cohorts).  In practice to preserve the utility of the data the string would need to be longer.

Let's assume we have six users with the following cohort IDs:

001
001
011
111
101
100

The first two users are in the same cohort, the third user is near the first two because only one bit needs to be flipped to go from 001 -> 011.  That single bit flip property applies for the 011 -> 111, 111->101, and 101->100.  Assuming that SimHash has done its job be now have users placed in cohorts that allow us to understand which ones are more alike.

One might think we are done, but we still need to look at k-anonymity.  For our purposes, let's say we want data to have a k-anonymity of 3.  If we were to expose the cohort IDs as is, we would violate that principal as no cohort has more than two users in it.

SortingLSH

This is where the property of Cohort IDs being closer to each other being more alike comes in handy.  Instead of using the cohort ID that SimHash produces, what if we just sorted all of the users by their cohort ID and then split the resulting set into equal sizes that we could guarantee preserved k-anonymity.  In our example above it would mean we would end up with two cohorts.  One consisting of the first three users and the second consisting of the last three.

To do this we are presented with a few challenges.  The first challenge is that it now seems we do need a central service to do the sorting.  It isn't entirely clear from the FLoC proposal how this is going to be tackled, so I'm going to ignore this for now.

The second problem is how to sort the users.  The paper recommends sorting the cohort IDs lexicographically.   Honestly after looking up lexicographic order on Wikipedia I was no closer to understanding what was meant by it.  With an understanding of how SimHash works and what is trying to be achieved it was obvious that the order needed to preserve the minimum number of total bit flips between users across the entire set of users.  A well sorted list of users would never but two users next to each other if their was a better choice.

Obvious Sorting Techniques

I was hoping that there was some magical property to binary strings where I could just sort them "alphabetically" and end up withe the desired results.

001
001
011
100
101
111

The total number of bit flips when sorting alphabetically is six, which for six users seems like a pretty good result.  However compared to our original list, which only had a total of 4 bit flips, it is clear that sorting alphabetically isn't the best strategy.

My next try was to remove the zero-padding from the binary strings and doing an alphabetic sort.  

1
1
100
101
11
111

Again, a total bit flip count of six.  Obviously not our solution.  The next try was to sort the decimal representation of the strings.

1
1
3
4
5
7

Still at six.  Still not our solution.

Nick's Algorithm, Brute Force, and Gray Codes

At this point I was pretty annoyed, because I had gone through all of the obvious ways to sort the data and I still hadn't ended up with an answer.  My next attempt was to try to come up with an algorithm that would choose the ideal sort.  My attempt was pretty simple.  It would start with the lowest value cohort ID and then look at all of the remaining cohort IDs and select the one that was the closest. Then repeat that process for the remaining IDs looking for the closes from the one most recently selected.

If I run my algorithm on our list of six users I actually match the original list with only four bit flips.  But would this hold for larger sets of users and longer binary strings?  To test it out I created a test data set, sorted it randomly, and counted the bit-flips and reran that test a million times.  If my algorithm was optimal I shouldn't ever be able to find a list of users who when sorted had less bit-flips.  

Unfortunately that was not the case and the reason became obvious.  My attempt was trying to find the next best user to put in the list, but without caring about the sequence of users after the next one.  Basically I was optimizing for a local optimum and not a global one.

At this point I was totally out of ideas, but I figured this must be a solved problem. In an ironic twist of fate I needed to Google how to implement optimal sorting for an algorithm proposed in a Google whitepaper that itself was based on SimHash, an algorithm created by Google to help with its search indexing.

I think of myself as a pretty savvy person when it comes to searching for things online, but this one was difficult.  I found some StackOverflow posts that I think were related to what I was trying to do that claimed that solving the problem was a restatement of the Traveling Salesman Problem.  I found at least one academic paper who proposed an algorithm that was 23 lines of pseudocode long and then I hit pay-dirt.

Gray Codes, aka Reflected Binary Codes, is a binary system that guarantees that the next successive code only varies by a single bit.  Basically, exactly what I wanted.  

The one remaining problem was how to calculate a gray code from our cohort IDs. Fortunately Rosetta Code offered  me a number of ways to do this in almost any language of my choosing.  Here is the Scala excerpt:

(n ^ (n >>> 1)).toBinaryString

When we use the resulting string to sort our original users we get back to only 4-bit flips and if I was more mathy I could probably write a proof that shows sorting by the gray code is provably optimal.