Privacy and security of ML is very different from Privacy & Security using ML and the following blog tries to shed some light on the overlapping and disjoint components of Machine learning.
The article would not go in deep to explain every aspect of the topic but rather point the direction for readers to have an understanding and a perspective of how ML can be used for privacy and security while having its own inherit biases and shortcomings.
What is the difference between privacy & security? For the context of the article I would like to take a toy example and try to give perspective as to what the words mean.
You have a secret document which has your sensitive information and you would not like a threat actor to have access to this document, you place it safely inside a small password lock enabled locker. If the threat actor is able to break in and have access to your document, there is a security breach. If the document is further accessed and the actor is able to understand what is written inside it becomes a privacy concern. What if the information in the document was gibberish or in a language different than what the actor understands…
Security and Privacy as you could have imagined by now also have different approaches to safeguard.. Security measures include limited resource allocation and other traditional approaches where as privacy has encryption etc.
This is the more common and commercial version of machine learning, this primarily includes using ML algorithms and training models to use automation as a tool for endpoint security. There are various examples that can be considered where ML is used to detect attacks, We can use ML algorithms such as naive bayes in order to distinguish between legit email and a spam email which is shown in the spam folder, we use various IDS and IPS that build on ML techniques and the goal here is to identify a normal connection from a malicious connection, these connections can cause a class of other security attacks such as Ddos, botnets for further escalations, user to root and root to user attacks etc.
Besides this as per various regulation and compliance standards which require us to monitor the packets of data, log analysis and network segmentation it is possible to achieve using ML methods over traditional approaches.
This is where alot of recent funding, efforts and research has been going on. The privacy and security aspect of machine learning deals with the issues that arise when we use Machine Learning algorithms in order to have privacy and security gurantees. The training data on which models are trained is sensitive in nature, there are compliance rules for the MLaaS and other cloud and offline providers on how to make sure that the training data does not leak sensitive information even when queries are being shot at via api. However in the recent years there have been various studies particularly by Ian Goodfellow et.al which have gone on to show “Adversarial ML” can essentially extract and recover almost all of the training data. This is not just a privacy concern but having access to training data also means we can now mimic the model’s behavior and expected output. For ex. FMNIST and MNIST datasets have been prone to such attacks. A panda can be classified as a gibbon with more confidence than it being classified as a panda, this was essentially the start for whole array of ML attack techniques, you can poison the training data with time and the test data would always yield wrong results. These attacks were also extend to self-driving cars where the adversary had the perturbations over a STOP sign and essentially spoof the system. We also have had Model stealing attacks where through API query the entire ML models were replicated offline from a MLaaS provider and the pay to query structure was failed, this also gives rise to corporate espionage where it takes hours and money to train models.
Adversarial ML builds on the idea that the loss functions when plotted under certain constraints over a gradient under a statistical distance, can be converged in such a way that the system is spoofed.
Various defensive measures such as differential Privacy, federated learning, P.A.T.E have evolved but it is always going to be a game of cat and mouse between the threat actors and the good guys.
For adversarial techniques and indepth explanation I would follow up with the Fmnist example and notebook in futrue blog, the article here is like an introductory article to make reader’s understand the consequences of deploying ML system and how we have to be careful and meticulous in our approach.
There also arises an issue of fairness in algorithmic decision making when we deploy such ML systems with inherent bias, in the next blog I would write about fairness and ML systems.
Research papers, articles, blogs from which this article was built upon can be found here:
Here are some of my personal fav:
1- https://arxiv.org/abs/1609.02943 — model stealing via api