Ideas Mined from Trustworthy Tech Dialogues

Privacy in the Age of AI

In Conversation

In this episode of Trustworthy Tech Dialogues, Patricia Thaine delves into the development of privacy-preserving solutions essential for scaling generative Artificial Intelligence.

Patricia is the co-founder and CEO of Private AI, established in 2019 alongside privacy and machine learning experts from the University of Toronto. In her studies as a computer science PhD student at the University of Toronto, Patricia focuses on privacy preserving natural language processing, machine learning and applied cryptography. She also conducts research and develops computational methods for lost languages decipherment. Amongst many other accolades, Private AI is one of World Economic Forum’s 2023 technology pioneers.

The Call for Privacy Solutions in AI Development

Privacy has emerged as a cornerstone in the discourse surrounding Artificial Intelligence, driving conversations across governance frameworks, industry practices, and user expectations. A range of strategies exist to bolster privacy across tasks and ecosystems through Privacy Enhancing Technologies (PETs). However, as Patricia articulates, Generative AI has introduced an influx of uncertainty, highlighting the myriad risks associated with its increasingly pervasive applications in society.

Large Language Models (LLMs) can memorize information during its training and may later share it in different tasks or applications. This capability raises significant concerns, particularly regarding the protection of personal or sensitive data. Since LLMs can retrieve and reveal data encountered during its training stage in subsequent outputs, safeguarding personal and sensitive information has become a critical mission. Organizations must exercise vigilance to ensure these models do not compromise privacy, as the potential for inadvertent data leakage poses serious implications for both individuals and businesses. This persistent challenge has fueled a widespread call for data minimization—the principle that data collectors should utilize only the data essential for a specific task.

The Challenge of Adopting Privacy Solutions in AI Development:

A global demand for privacy solutions is echoed in privacy and data protection regulations. While regulatory frameworks differ in their definitions of ‘personal’ information, they universally mandate the protection of Personal Identifiable Information (PII)—any data that can be used to identify an individual. Drawing from her research and industry experience, Patricia identifies two primary challenges for organizations striving to meet these regulatory standards: a lack of understanding of what constitutes Personal Identifiable Information (PII), and an inability to pinpoint where risks lie within datasets. Furthermore, anonymizing PII at scale, across multi-lingual and multinational datasets, becomes an increasingly complex task for each organization.

While many regulatory standards are traditionally enforceable for structured data—data with a standardized format for efficient access and analysis—adhering to these requirements is particularly challenging for unstructured data, like images, videos, and documents. Unstructured data complicates compliance with general privacy and data protection regulations but also holds immense potential for innovative future data practices. The ability to accurately identify, anonymize, and replace PII in large datasets can unlock new insights and drive innovative applications in AI and data analysis.

Value-Add of Synthetic PII Across Industries:

Synthetic datasets are often used by researchers to build and test algorithms without compromising real information in the process. Synthetic data is artificially generated to resemble real datasets, creating a stand-in that mimics the patterns, characteristics, and relationships found in actual data. Patricia explains the rationale behind synthetic PII, as opposed to fully synthetic datasets, to strike a balance between preserving privacy and enabling further innovation.

Unlike fully synthetic datasets, generating and replacing only synthetic PII allows datasets to retain maximum contextual information. This contextual information within datasets is invaluable for future AI training. For instance, one key advantage of AI is its capacity for sentiment analysis, the process of training a model to identify sentiment or tone in conversations. These insights enable the model to adapt its tone in conversational applications or accurately detect the sentiment of a user’s request and respond appropriately. As Patricia elaborates, sentiment analysis derives insights from the language surrounding PII, making it particularly suited for privacy-preserving approaches. Synthetic PII adds an additional layer of privacy protection, often termed “hidden-in-plain-sight,” where real PII becomes difficult to discern amidst synthetic PII.

Privacy Enhancing Technologies (PETs) and Governance Frameworks:

As stakeholders strive to balance safeguarding privacy with promoting innovation, various privacy frameworks and industry implementation plans rely on PETs to strike this delicate equilibrium. Implementing PETs at the organizational level necessitates a thorough analysis of which tools best suit the organization and how to seamlessly integrate new practices into existing operations. Although numerous data protection regulations worldwide advocate for data minimization, most remain vague about the extent to which data minimization is sufficient. Different PETs serve distinct functions, and many solutions are still in active development. As a privacy expert, Patricia elucidates how PETs like Federated Learning can be combined to address specific privacy challenges.

Federated learning, a sub-field of machine learning, involves sending the model to the data source or interaction point to learn locally. This approach allows data to remain decentralized, avoiding aggregation for learning purposes. This is particularly relevant for industries dealing with sensitive data that cannot be transferred to a third party for analysis, such as hospital training a model.

Many PETs, including Federated Learning, are still evolving. While governance frameworks increasingly rely on these technologies, clear guidance on best practices is essential for successful implementation. Patricia highlights that HIPAA legislation in the United States provides a high degree of clarity regarding the criteria for an acceptable dataset. “This clarity fosters a greater ability and comfort in innovating with data, leading to a significantly higher level of confidence among organizations.”

Privacy by Design and Trust by Design:

This is a pivotal moment in both regulatory and technological development, where we could collaboratively shape a future that balances privacy protection with AI innovation. Privacy has now become a fundamental pillar of Trustworthy AI. Generative AI has pushed privacy-by-design thinking to the forefront of the technology development process. Patricia emphasizes that meeting this moment also involves integrating “trust by design” principles within organizational implementation. Organizations are increasingly adopting trustworthy solutions throughout the technology lifecycle, while regulatory frameworks are shaping standards for appropriate datasets and downstream applications. The dialogue surrounding privacy in AI is not just a theoretical exercise; it is a call to action for all stakeholders—industries, governments, and civil societies alike—to unite in building a future where innovation and privacy are not in conflict, but in harmony.

Privacy in the Age of AI

In Conversation

In this episode of Trustworthy Tech Dialogues, Patricia Thaine delves into the development of privacy-preserving solutions essential for scaling generative Artificial Intelligence.