Open Data Needs HTTPS Too

published by Eric Mill on

Recently, Tim Berners-Lee (inventor of the web and President of the Open Data Institute) informally asked a W3C working group to consider relaxing "mixed-content" restrictions on fetching insecure data in web browsers, apparently in order to support "open data mashups".

Berners-Lee supports a secure web, and he argued that loosening the rules would ease the web's transition to HTTPS. However, loosening mixed-content restrictions for open data is technically unworkable1, and the proposal is unlikely to go far.

But his proposal has a clear and (sadly) correct implication: a whole lot of open data is only available over an insecure connection.

Which is too bad, because open data especially needs HTTPS. From a response I posted on GitHub:

Any data that's important enough to call "open data", that we think there's value in people, businesses, and civil society depending on -- is important enough that we should demand that it's provided over a secure and private connection.

This is why Gov.UK requires HTTPS for government services, and why 18F will only build .gov sites that use HTTPS. It's why Sunlight ensures its legislative APIs are encrypted, and why HTTPS is increasingly the priority of the open data community.

"Open data" is a broad term, covering services and information that touch every part of our lives. No ISP, cafe, or government should be able to use HTTP requests for open data to identify someone's location, profile their health, or track their browsing activity -- or to sell the metadata to someone who will. Open data on the web has to serve the public's interest, and HTTPS is the most basic way to achieve that.

So why are so many of the leaders of the open data community so far behind?

  • Code for America sets a bad example for its fellows and brigades by not using HTTPS on its website or its API.
  • OpenCongress sets a terrible example by literally transmitting passwords in the clear during account creation (and, until I badgered them for months, during every login). What's mystifying is that OpenCongress already supports HTTPS, and chooses not to simply force it. Update: OpenCongress now forces HTTPS across the board.
  • Similarly, the homepages for the Sunlight Foundation, the Open Data Institute, and are all HTTPS-ready, but don't force it -- and so Google leads all of their visitors to the HTTP version. You'd only know they supported it by typing in the 's' yourself. Update: and the Sunlight Foundation now each enforce HTTPS for their entire websites.
  • Anyone who scans the complete list of .gov domains will find that the overwhelming majority of .gov domains are served over insecure connections.

To be clear, I understand that transitions can be work, especially for APIs. When I moved Sunlight's Congress API to HTTPS, we held off on forcing a redirect while I set up analytics to track use of encryption, scrambled around GitHub updating popular clients, and eventually emailing users directly warning them to switch. It cost me time and energy, and owners of popular services are understandably risk averse.

But plenty of other open data services have found the time to do it. GovTrack and CourtListener moved in 2013, and Carl Malamud has been going out of his way to obtain only the most top-notch certificates for many years.

And frankly, it's just the way the internet is moving.

In 2013, we got leaks about the NSA and GCHQ exploiting their access to internet backbones to correlate and copy wholesale as much internet traffic as they possibly can. In 2014, we got Comcast and Verizon injecting ads and tracking beacons into their customers' internet use.

Those who love and care for the internet agree: those are all attacks, and they affect all of us. Strong HTTPS makes those attacks impossible to pull off at global scale. So in 2015, I hope we see the open data community tighten up its wires with the rest of us.

1. Mixed content is when you visit a website over https://, and it tries to load images or fonts over insecure http://. Browsers today will let images by with a warning, but will block fonts, scripts, and other forms of "active" data. Without this protection, the promise of a website's https:// doesn't mean very much, because pulling in insecure assets could completely compromise the security of the website.