Most of us should be familiar with Chrome's Safe Browsing feature, which warns if you're visiting a page that might be dangerous. It's not like most of us are actively seeking out malware or phishing sites, but once in a while, some link on Reddit, an email, or some deep Search rabbit hole takes you to an unsavory place, and Chrome lets you know it might not be a good idea to proceed. I never really thought about it very deeply it, but I always assumed that the system worked because Google knew through Chrome which pages I was visiting and kept an eye out based on a list. That's partly true, but it escapes one critical and interesting fact: The Safe Browsing system actually doesn't tell Google which pages you're on, preserving your privacy just a little more.

I should stress that this discussion fully escapes the advertising boogeyman, which is another very valid avenue for privacy infringement. And I'm not saying that other systems like your tab-syncing browsing history don't complicate the discussion of what Google knows and doesn't know. But Safe Browsing isn't just streaming to Google's servers a continuous list of the URLs you visit, which is how I would have assumed this feature might work. In fact, according to a recently published blog post describing the APIs and systems used, Chrome maintains a local list on your device to compare against. But this isn't just a list of URLs.

Intuitive as that might seem, it would quickly become storage and resource-intensive — Google's protecting us from more pages than you might think. Instead, your phone or computer maintains a list of so-called "hashes," which are cryptographically generated semi-unique strings of letters and numbers created from each unique URL.

Google has workarounds in place for the logic behind how this system works to ensure that a subtle change in the URL won't easily circumvent the security, so random changes to the subdomain or sub-pages won't allow malicious actors to make subtle tweaks that evade detection. The lists are also updated about every thirty minutes. Furthermore, even full hashes would make a list too big past a point, so Google actually only stores the first four truncated bits of the hashes.

That's still enough unique data that most safe URLs won't be accidentally caught, but some still will: Non-computer-scientists should note that four bits are enough to store up to 4.29 billion unique identifiers, but you can still run into "collisions" where two unique pages may actually generate the same hash for that bit length. That's problematic, as it can mean two different pages generate the same hash in this system, which means Chrome might recognize a good page as a potentially known bad one.

Safe Browsing Hash Comparison -anim
Image source: Google.

A more visual representation of how Safe Browsing's hash-comparison system works.

There are over 1.88 billion websites now, and even old estimates place the number of unique pages indexed by Google in the trillions, so Safe Browsing is bound to make a few false positives if it only uses this system. Thankfully, there's another step. When Google does find that a partial hash of a page matches the corresponding partial hash lists for malware or adware or what have you, Chrome asks Google for all URLs that correspond to that partial hash, so it can make a local comparison.

That means your browser is then able to make a second judgment call locally where Google can't see it, comparing a much longer hash identifier for the site you're visiting against Google's much bigger list while escaping the headache of actually storing that big list fully locally, where it would take up a ton of space. That also means that Google's servers and APIs dealing with this Safe Browsing feature don't actually know precisely which page you're on — the comparison happens on-device.

Arguably Google's servers still kind of know which sites you might be visiting since it could be one on that newer and much smaller list, but that's not guaranteed either. Because Google does this operation continuously for all resources on every page you visit, it could be that you're navigating to an arguably "okay" base site, but the page included embeds, images, etc., that point to dangerous places. And this is still a greater degree of privacy preservation than a URL-based browsing security system would otherwise have.

The best possible solution from a privacy-first perspective would be to store the full and complete list locally all the time, but since that's infeasible for resource reasons, this is the second-best approach, and it still partially obfuscates which pages you're visiting from Google's Safe Browsing system.

Maintaining privacy on the internet is pretty hard, and this arguably isn't a perfect solution, but it's a decent and privacy-protecting way to implement a URL-based malware detection system. Now we just have to make sure the ad systems, cookies, third-party libraries, and every bit of software on the innumerable machines between you and the websites you visit are also doing their part to respect you, too.