Some thoughts on the Myki data leak

This, the other week, was interesting:

In a concerning revelation, researchers have found that myki, in conjunction with social media, can be used to uncover a wealth of information about card users.

ABC: ‘Shocking’ myki privacy breach for millions of users in data release

Here’s the report and media release from the Office of the Victorian Information Commissioner:

Information Commissioner investigates breach of myki users’ privacy

Here’s the original study:

Two data points enough to spot you in open transport records

What happened was that PTV released a whole bunch of Myki touch on/off data for a “datathon” event, where people see what handy things they can do with the data.

It was “de-identified” – that is, Myki card numbers were removed and replaced with another identifier, which could link trips from a single card together, but not back to a card holder.

Or so they thought.

Part of the problem was they left in a flag indicating the card type. This is not just Full Fare (Adult) or Concession – it goes down to the precise type of Concession or free pass. For instance type 39 is a War Veterans Travel Pass; type 46 is a Federal Police Travel Pass.

With more than 70 types of card, some of the more obscure types are pretty rare, so if the person you’re trying to track down is using one of them, they’re probably not that hard to find, particularly if you know which stations they regularly use.

That’s presumably how the researchers found Anthony Carbines, State MP for Ivanhoe, I’m guessing travelling on a State Parliamentarian Travel Pass – by looking at the data, and matching it up with his social media posts, which included at least one from Rosanna Station.

I’m probably in there too. And so are you. (I’ve only seen a sample of the data; a mere 30 million card touch records out of the total 1.8 billion originally released.)

Myki machines at Southern Cross

Ultimately, it’s good that data sets like this are released. There actually should be a lot more of it – at present, the data released by PTV is very limited. Anything related to patronage or bus service performance is really difficult to find.

Perhaps the problem with not adequately cleaning the data is that they’re out of practice. Almost everything currently available either has nothing to do with passengers directly, or is at such a high level that it could never be used to find individuals.

More data should be out there. Ultimately, the public transport network is funded by taxpayers, and it should be a lot more accountable and transparent than it is.

One thing’s for sure: if they have a go at releasing this level of detailed data again – and I hope they do – they’ll need to be more careful to remove information that could be used to re-identify individuals.

6 thoughts on “Some thoughts on the Myki data leak

  1. Gary

    You’re 100% right. There is a lot of good from releasing data for these hackathons but cleansing the data is critical. There is some well known principles to follow and many companies highly skilled in this area…. Esp in health

  2. Peter

    There is a lot of good from releasing the myki data not just for “hackathons” but for use by the general public, in the public interest.

  3. Is there any detail on what was the results from the hackathon event. I know NSW have also run these in the past to encourage development of services.

  4. enno

    If you thought myki could be used to analyse bus patronage, you might be disappointed. My surveys in the far south suggest 70% of bus users don’t have a functioning myki.

  5. Thanks Daniel. The researchers’ paper is available as a free download via https://arxiv.org/abs/1908.05004. Interestingly, they point out that while it was possible to re-identify Mr Carbines from a single tweet using the card type, the absence of this special information only meant that three tweets over three years, rather than a single instance, were required to uniquely spot him in the records.

    I’d echo Daniel’s point: the government keeps far too many secrets regarding the way our public transport system is used and operated. The PTUA has campaigned for years on reforming this secretive culture and ensuring the public, who fund the whole thing, have the right to know more about how it’s working. It’s certainly fair to ask whether the government is simply out of practice when it comes to openness with the public.

    But this misstep does point to a deeper issue. The researchers’ paper stresses a crucial distinction between ‘open government’ and ‘open data’. This particular data release provided relatively little transparency about the public transport system itself—given many people apparently obtained incomplete data sets, so couldn’t even reliably assess aggregate patronage levels—but an awful lot of transparency about the travelling public.

    Only since Myki has it even been possible to build up a longitudinal picture of individuals’ travel patterns. It’s not something advocates particularly asked for, and it appears all the best public transport systems in the world have managed to achieve what they have without access to this level of information about individual travellers. That stands to reason, given a fundamental principle of public transport (as with roads, and many other kinds of public infrastructure) is that as a universal service it should respond primarily to people’s needs in the aggregate and not try too hard to tailor itself to individuals’ circumstances.

    Like Daniel, I’m hoping the blowback from this data dump won’t become an excuse for the government not to release more of the kind of detailed operational data that allows the public to be informed about its transport system and how the government is managing it. But I’m not convinced that simply because the government can collect certain kinds of information it necessarily should in every instance.

  6. malcolm

    seems the dataset has already been pulled, sadly … I regularly download my trip data for similar reasons, so I was interested in taking a look …

Leave a Reply

Your email address will not be published.