Training an AI: Is your data fair game?

Deep Learning, the cutting edge of today’s AI technologies, allows a machine to replicate what a human can do, but at greater scale and speed. It does this by observing and replicating how a human does a certain task thousands or even millions of times, and getting successively better at the task until it is as good as a human. This process is known as training, and the algorithm is only as good as the data that it is trained on. But where does the data come from? Does a company need to secure the rights for the data that is used in the training? Could a company be using your private data to train its systems today? Hint: the answer is yes.

To better understand what is involved in the training process, let’s take a look at an example. Facebook has a feature called “Tag Yourself,” which scans every picture that is uploaded, identifies if you are in the picture and prompts you to see if you would like to tag yourself. Facebook is using a deep learning system to recognize your face. To train the algorithm, Facebook needed to successively feed in hundreds of thousands (or more!) of examples of correctly identified faces to let the algorithm get better and better at recognizing them. This training process takes a lot of computation power and involves a lot of mathematics, but the end result is a model that can be very fast at identifying new faces (including ones that were not part of the training data).

So, what is needed to have a high quality algorithm is a lot of high quality training data, specific to the task at hand. But where does this training data come from? Because a computer cannot create the training data, it needs to be created by people. And when you need hundreds of thousands of correctly categorized pictures, or in other cases correctly categorized pieces of text, it can be very costly for a company to create or curate this information. Many companies use outsourcing to get cheaper human labor, including Facebook, as indicated in a The Verge article titled “Facebook’s contract workers are looking at your private posts to train AI:”

Facebook confirmed to Reuters that the content being examined by WiPro’s workers includes private posts shared to a select numbers of friends, and that the data sometimes includes users’ names and other sensitive information. Facebook says it has 200 such content-labeling projects worldwide, employing thousands of people in total. – The Verge

Facebook is building AI functionality to improve its product, and it needs to train its AI system to better interpret posts that users are making. But in order to get high quality training data to build a robust deep learning algorithm, it is letting outsourced workers read public and private posts that users have put on the system. From Facebook’s point of view, it needs to use actual users’ posts as training data in order to best tailor the AI algorithms for users.

But is this OK? It’s almost certain that users did not intend or expect that their data would be used in this way – sophisticated deep learning systems that require this kind of training have only been around recently. It seems Facebook is operating within a gray area, and we have not had time for a conversation around whether or not it should be acceptable. Until we can have a true debate that leads to an industry consensus or have the government step in to regulate data usage, it is up to each of us as individuals to understand what is happening and make an indivdual decision to determine what is or is not acceptable.