Synthetic dataset of human trafficking victims could enable big data work without privacy compromises - Start Up Gazzete
Get In Touch
541 Melville Ave, Palo Alto, CA 94301,
Ph: +1.831.705.5448
Work Inquiries
Ph: +1.831.306.6725

Synthetic dataset of human trafficking victims could enable big data work without privacy compromises


To combat human trafficking effectively, those who fight it must understand it, and these days, that means data. Unfortunately, for obvious reasons there is no convenient index of trafficked persons, although this confidential information is somewhat abundant. Microsoft and the International Organization for Migration may have found a way forward with a new synthetic database that has all the important characteristics of real traffic data, but it is completely artificial.

While each victim is unquestionably individual, the basic high-level questions, such as which countries are increasingly the source or medium of trafficking, what routes and methods are used, and where the victims end up, are a matter of statistics. The evidence for identifying trends and patterns, crucial for prevention, is locked into thousands of these individual stories that most would rather not publicize.

“Administrative data on identified human trafficking cases represents one of the main sources of data available, but such information is highly sensitive,” said IOM program coordinator Harry Cook in a press release describing the whole of data. “IOM has been delighted to work with Microsoft Research over the past two years to advance the critical challenge of sharing such data for analysis, while protecting the safety and privacy of victims.”


Historically, for things like crime databases and medical information, the strategy is to write liberally, but this “de-anonymization” method has proven ineffective against any serious attempt to reconstruct the data. With numerous leaked and public databases and computing power on tap, redacted information can be supplied quite reliably.

The option adopted by Microsoft Research is to use the original data as the basis for a synthetic data set that preserves all the important statistical relationships from the source, but none of the identifiable information. And it’s not just about turning “Jane Doe” into “Janet Doeman” and her hometown from Cleveland to Queens. Instead, groups of no fewer than 10 people with similar or overlapping data are merged to create a set of attributes that represent them statistically accurately, but cannot be used to identify them individually.

Naturally this does not have the granularity of the original data, but unlike font sensitive this data can actually be used. It is not necessarily for some working group to analyze and say “okay, the next smuggling operation will be based on …” but rather this data, based on first hand evidence, can be pointed out as a factual record to address this to level of politics and diplomacy. Where before one may have had to say more generally that Country X or Government Z were negligent or complicit in these matters, having hard data to back that up allows you to say that “36 percent of sex trafficking victims go through its jurisdiction. “

Author avatar
Joshua Smith

Post a comment

Your email address will not be published. Required fields are marked *

Este sitio web utiliza cookies para que usted tenga la mejor experiencia de usuario. Si continúa navegando está dando su consentimiento para la aceptación de las mencionadas cookies y la aceptación de nuestra política de cookies, pinche el enlace para mayor información.plugin cookies

Aviso de cookies