For IT and security leaders, AI governance has become a high priority. As we build more AI models and use more AI solutions, we need to implement right-sized governance and policies that can minimize risk and keep our businesses secure.
There is no doubt that AI governance is critical. But because AI depends on data, AI governance has to start with data governance. We need data integrity, data protection, and access control capabilities for the data and AI models — as well as strategies for complying with data privacy and localization obligations.
Improved data governance will not only reduce the risks associated with AI; it will also help us maximize the impact of AI for our businesses. If your organization is investing in AI, strengthening data governance cannot wait.
____________________________________________________________________________
Why is governing the data used for AI so difficult? For the most part, organizations face all the same data governance challenges as they have for years. But the use of data for building AI models does create some additional complexity, such as:
Preserving data integrity
For AI models to provide the right answers, they need complete, accurate, and consistent data. Aggregating that data and maintaining its integrity throughout its lifecycle can be difficult, though.
To maintain data integrity, organizations must employ a multi-faceted approach that safeguards against data corruption, loss, breaches, and other risks. This approach should tightly control access to data and models to prevent accidental or deliberate corruption. Meanwhile, to avoid model drift, it’s also essential that the data used for training remains consistent with the data used during model deployment.
Maintaining data privacy and confidentiality
The foundation for a sound data governance strategy begins with understanding where the data is. Maintaining data privacy and confidentiality is more challenging when data has left the organization. And if you contribute your data to the training of large AI models, your data is almost certainly leaving your organization.
Imagine the top 20 security companies decide to share their security operations center (SOC) data to train an external model. This model, hosted in a public cloud, could generate highly accurate and powerful insights. However, the data from all these companies would be mixed together, making it incredibly challenging to ensure that sensitive information from any single company remains fully protected.
Controlling internal access to in-house AI models
Building internal, in-house models is much less risky than contributing data to external models. With in-house models, you can better prevent external individuals or companies from accessing your data. But you still have to control internal access to your models.
You might decide to build an internal AI model for HR, for example. The HR team might want to use an AI chatbot to answer employee questions or use AI to streamline payroll or administrative tasks. Because HR data includes very sensitive employee information, such as the amount of each employee’s compensation, you would have to be very careful to control internal access to that data and the models being trained.
Complying with data localization and sovereignty obligations
Data localization and data sovereignty laws add another challenge to AI and data governance. Large AI models are often trained in public clouds, which have the compute and storage resources needed for training. But public cloud data centers are not always available in the countries or regions where data is supposed to reside. As a result, organizations need ways to train — and run — models within specific jurisdictions.
Implementing effective data governance has been a central objective for IT and security teams for at least 20 years. The rise of AI reinforces the need for a strong data governance strategy that spans every stage of the data lifecycle. That strategy should tap into capabilities designed to maintain data integrity, prevent data loss, control access to data and models, and comply with data localization regulations.
Maintain data integrity
How do you reduce the risk that data will be altered in ways that affect models? Encrypting data and employing a Zero Trust model can help prevent unauthorized changes that put data integrity at risk. Audit logs can track where data travels, who touches it, and what changes have been made.
Prevent data exfiltration
Data loss prevention (DLP) capabilities are key for identifying and blocking data from leaving your organization — and from being used as input for an unsanctioned AI model.
Organizations also need tools that can prevent SaaS apps from collecting and using internal data to train the app vendors’ external models. For one of the companies where I was the CISO, an app vendor created a policy that said anything users entered into the app could be incorporated into the vendor’s large language model (LLM). I understood why the vendor would want to do that: There’s no doubt it could help them improve their product. For example, if we were using AI to respond to our internal support tickets, we would want to collect data on the top requests at our company.
Still, we wouldn’t want potentially sensitive information to leave our organization through another vendor’s app. Using a cloud access security broker (CASB) combined with an AI firewall can help prevent that type of data loss.
Control access
Granular access control capabilities can help ensure data integrity and protect sensitive information used for in-house models. Unlike traditional VPNs, Zero Trust capabilities can help make sure only the right individuals can access particular data.
Why is granular access so important? Let’s return to the use of AI for HR: Maybe your company is using AI to streamline performance reviews and make compensation recommendations. You might want to let a manager see their own compensation information and the compensation information for their direct reports, but not anyone else’s. The right Zero Trust capabilities can give you that level of control.
Adhere to data localization rules
With the right localization controls, you can decide where data is inspected and make sure data and metadata do not leave a particular region. For example, you might have log metadata that contains user IP addresses, which are considered personal data in the EU. That metadata needs to stay in the EU. The right data localization controls would ensure it is not used to train a model in the US.
We’re at a critical moment in the evolution of AI, as organizations build and run models that could reshape our world in the next 20 to 30 years. To make sure we can continue training and running AI models safely — without exposing data, reducing the accuracy of models, or jeopardizing compliance — we have to start strengthening data governance now.
For many organizations, a connectivity cloud can provide the best path for improving data governance by enabling them to regain control over data throughout its lifecycle. With a unified platform of cloud-native services, organizations can apply the data protection, security, and data localization capabilities they need while avoiding the complexity of managing multiple distinct tools.
Better control of data will enable us to build more powerful AI models and applications. We can make sure we are delivering accurate results and minimizing risks — today and in the future.
This article is part of a series on the latest trends and topics impacting today’s technology decision-makers.
Learn more about how to support the use of AI in the enterprise while maintaining security in the Ensuring safe AI practices: A CISO’s guide on how to create a scalable AI strategy guide.
Grant Bourzikas — @grantbourzikas
Chief Security Officer, Cloudflare
After reading this article you will be able to understand:
Why successful AI governance has a dependency on strong data governance
4 key challenges to governing data used for AI
Strategies for strengthening data governance
入门