In the previous article, we looked at the key challenges that show up from an end to end consideration of AI & ML ecosystem and workflows from a ‘traditional’ cybersecurity standpoint (without going much into the AI & ML specifics of the workflows). In this part, we will begin zooming in on the AI & ML components — starting with an exploration of the interesting new assets that AI & ML bring into the picture. This exercise will help us become cognizant of these assets allowing us to treat them alongside other critical information assets when we perform threat modeling and choose security techniques for the end to end system. With the assets identified, it will be easier to consider possible ways that attackers can breach the Confidentiality, Integrity or Availability (CIA) of these assets, the impact that each type of breach may have on the business and to systematically ensure that we design and implement appropriate protections against those attacks.
In the previous article, we have already discussed how immense extents of sensitive business data may be involved at various points in end to end AI & ML workflows and how — owing to (a) a diaspora of new tools and frameworks, (b) new types and combinations of systems/sub-systems and (c) new stakeholders that are involved — we are already looking at a handful of security and privacy challenges. We will tread forward from there and peel through the other layers to identify interesting assets ‘downstream’ relative to the volumes of business data that is used for ‘learning’.
Interesting ‘new assets’ that AI & ML introduce
At a very simple level, a lot of ML algorithms (especially ones concerned with prediction or classification) essentially try to work on a numerical problem that looks like:
y = w.x + b
Here ‘x’ represents the inputs (or features) and ‘y’ the corresponding outputs or outcomes as observed in past data.
So, in a home sales prediction context, ‘x’ may be the attributes of homes that influence their price (such as the built-up area, the yard size, the locality, the condition, etc.) and ‘y’ may be the prices home sales have fetched in the last few months. The task of the algorithm is to discover the optimal ‘w’ and ‘b’ that can explain the past data and that may be used to make good future predictions ‘y’ given a previously unseen inputs ‘x’. This process of working out the ‘w’ and the ‘b’ (referred to collectively as ‘weights’ or ‘w’ hereafter) is called ‘training’ or ‘learning’.
Once the algorithm ‘learns’ the weights ‘w’, we can use them to predict what a home newly placed on the market will likely sell for. (In most real world problems, the evaluation of ‘w’ involves computationally intense and expensive operations on very large matrices.)
In the backdrop of this really brief overview, let us look at the interesting new assets that emerge from AI & ML:
In many ML problems, data scientists work closely with domain experts to devise the best ‘representation’ of the data to machine learning algorithms. This is called ‘feature engineering’ and can take a lot of effort and insights. Domain expertise helps towards the intuitions on what might make interesting features to consider (or not) and data scientists or statisticians help in figuring out the most appropriate ways to factor in those features. Thus a good choice and combination of features can yield better results even if you are starting from the same training data and that makes the artifacts of ‘feature engineering’ important assets from a data protection standpoint.
(These days, it is becoming more common for the model to ‘learn’ these features by itself — especially in larger systems. The technique is called ‘feature learning’ or ‘representation learning’ and the rationale is to feed in all available inputs and let the algorithm (internally) figure out which features matter and which don’t. When that happens, the ‘features’ remain internal to the model. That is, there is no explicit artifact called ‘features’ to worry about protecting. However, where features are hand-engineered they represent a valuable artifact that needs to be treated just like any other information asset.)
2. Model Hyper-parameters
Most machine learning algorithms have several ‘settings’ that can be tweaked to modify the behavior of the algorithm. These settings can be thought as ‘design choices’ that define the physical characteristics and behavior of the underlying machine learning model. For e.g., in the case of linear regression, ‘learning rate’ is something that influences how fast the model converges (or not) in its search of the optimal weights. In the case of deep neural networks (see pic above), there are many other choices such as the number of layers (depth of the network), the number of neurons in each layer (the height of the layer), the batch size to use during training, the number of passes to make, the optimization method to use, etc., etc.
These settings are called “hyper-parameters” because their choice influences the eventual “parameters” (i.e., the coefficients or weights) that are learned by the model from the training data. In larger problems, dozens of such choices may be involved and it takes much work to discover and settle upon the correct combination which can provide desired outcomes. In problem contexts where the data itself is not unique (e.g., image recognition in a scenario where millions of images are available to all parties), these hyper-parameters represent a competitive edge. In other words, once you have invested a lot of hard work to create a model that has started producing great results, the respective hyper-parameters are no different than any other ‘high value asset’ (HVA) for your organization and it becomes important to think about protecting them wherever they may reside.
3. Weights or Coefficients
Similar to hyper-parameters, the weights/coefficients (the ‘w’ and the ‘b’ from the “y = w.x + b” above) learned by the model represent all the invaluable ‘insights’ that the model has gleaned from millions of records of data that it peered through in the training phase. The future predictions from the model are a simple (and often quick) mathematical operation on the new data point using these weights.
Just like ‘hyper-parameters’ these weights are ‘reusable’. Moreover, they are even more ready-made for reuse as compared to hyper-parameters. Using a technique called ‘transfer learning’, other data scientists can start with weights exported from your model and improve a given solution further or attempt to solve a variant of the original problem. Indeed, this is a common collaboration technique used by data scientists.
However, depending on the business context, it may not always be wise to share these weights (intentionally or otherwise). For instance, if you are a stock trading firm and your team worked out an innovative ML-based trading scheme that you are hoping to make lots of profits on, you really don’t want another competitor to employ ‘transfer learning’ using those weights. (For then they may start making similar or better trades than you given similar market conditions.)
4. Compute Investments on Specialized Hardware
Most large scale/real world problems require a lot of compute time to learn the right coefficients. Due to the extensive amounts of matrix-based computations involved, most solutions require specialized hardware — either field-programmable gate arrays (FPGAs) or graphic processing units (GPUs) — to compute the weights in a time-efficient manner. These hardware technologies are used due to their ability to perform computations in a highly parallel manner (which is the case for high-dimensional matrix calculations).
Also, because these can be prohibitively expensive to own and the technology has been evolving rapidly, most scenarios employ cloud-hosted services to get access to such hardware so one can use the ‘pay as you use’ paradigm.
Most people have heard and gone ‘Wow!’ on how the AlphaGo system created by Google’s DeepMind team beat the world champion of Go (a highly intuitive game where brute-force computation and depth of search are not enough). However, a detail that gets missed amongst the buzz is how much compute effort went into that.
The DeepMind paper on their feat mentions a training time of 42 days and various progressions of hardware some involving 48 Tensor Processing Units (TPUs) and others involving 168 TPUs — all hosted in the cloud.
Investments into compute and resulting expenses in many real world problem contexts will likely be similarly steep. Thus all those ‘weights’ in that ‘w’ matrix represent a lot of $$$. Moreover, one has to also be careful about access to the compute as you don’t want attackers to use or repurpose it towards their own compute needs (such as attempts to crack passwords from a stolen database or to bolster the compute capacity in their botnets).
5. Custom Algorithms
The fields of AI & ML are rapidly growing as there is a lot of opportunity to apply the techniques in creative ways and in new problem domains. Interestingly, existing ‘best practices’ for algorithms and models from one problem domain don’t necessarily ‘carry over’ when applied as is in an altogether different domain. That is, what produced excellent results in one problem space may not be as effective or insightful in another. Thus teams that start from a solution that worked in another context often find themselves investing heavily into experiments and making significant modifications and improvements before they start getting interesting and exciting results. So if you are working on a novel problem, you shouldn’t be surprised if you are looking at a proprietary algorithm after you have solved it effectively. And when that happens — congratulations, you have yet another invaluable asset at hand to protect!
Securing the new assets
We have looked at the interesting new types of information assets that ML & AI solutions bring into the picture — things such as engineered features, model hyper-parameters, weights/coefficients, highly expensive compute and custom or proprietary ML algorithms.
The attacks that these new assets are subject to can be segregated into two groups. Attacks that can be viewed as ‘traditional’ data tampering/data theft attacks that we might see in other data protection contexts. (For e.g., theft of model parameters from a data scientist’s laptop or mailbox or theft of weights from an unprotected file share.) The techniques and approaches to protect these assets from such ‘traditional’ attacks were already covered in the previous article. So long as these new types of assets are also included as threat targets and adequate protections are built into the different stages of a workflow (where corresponding assets surface), we can consider this category of attacks as covered.
The second and more interesting category represents attacks that stem from the way AI & ML algorithms internally work. We will continue our journey in the next part and explore those by venturing into the ‘runtime’ of AI & ML systems.
[While on the topic of ‘assets’, may I also mention that — in the current times when AI & ML talent is so scarce relative to the demand — good data scientists and machine learning/data engineers also represent invaluable assets to any business. All the good HR practices of keeping your best people happy, challenged and excited, motivated through apt recognition and rewards, etc., should help for those!]
This article originally appeared in Towards Data Science