BlogRegulation

Running RAG over regulated data — the review checklist

RAG over GDPR-regulated, HIPAA-regulated, or financial data requires controls at the data layer, the retrieval layer, and the output layer. This checklist maps the requirements by data class and the evidence an AI security review must produce.

Drel Research12 min read

Building a RAG system over regulated data is one of the highest-risk RAG deployment patterns an organisation can choose. The benefits are significant — making regulated data queryable through natural language dramatically reduces friction for authorised users — but the compliance obligations are substantial, and most organisations discover this after the system is already in development.

Regulated data in a RAG knowledge base is not just “data that requires extra care.” It is data whose processing is governed by specific legal obligations that apply independently of how the data is used. GDPR obligations apply to personal data in a RAG corpus whether or not the system is designed with privacy in mind. HIPAA obligations apply to protected health information in a knowledge base whether or not the RAG system is in healthcare. The regulation follows the data, not the intent.

Why regulated data is different

RAG over a knowledge base containing only internal operational documents — process guides, technical documentation, approved policies — has a manageable security surface. The documents are known, their scope is defined, and access control can be designed around a static classification.

RAG over regulated data is fundamentally different for three reasons:

The processing basis must be established before the data enters the corpus. Regulated data cannot be ingested on the assumption that the existing processing basis covers it. A lawful basis for processing personal data in an HR system is not automatically a lawful basis for making that data retrievable through an AI query interface. The new processing activity must be assessed independently.

Data subject rights apply to the knowledge base.Personal data in a RAG knowledge base is in scope for subject access requests, erasure requests, and rectification requests. If someone's personal data is in the corpus, they have the right to know it is there and — in many cases — to have it removed. Most RAG implementations do not include the capability to locate and remove a specific individual's data from the knowledge base.

Breach notification may apply to the corpus. If the knowledge base is compromised — through an access control failure, an infrastructure breach, or a retrieval exposure — the regulated data it contains triggers data breach notification obligations. The notification obligations apply to the data categories in the corpus, not to the severity of the technical incident.

Regulated data in a RAG knowledge base does not dilute the regulation. Every obligation that applied to that data before ingestion still applies after. The RAG layer adds retrieval risk on top of the existing obligations — it does not replace them.

Controls required before RAG over regulated data

1

Data classification and ingestion gate

Every document must be classified for regulated data content before ingestion. Documents containing regulated data (personal data under GDPR, PHI under HIPAA, regulated financial data) must be flagged and routed through an elevated-approval ingestion path. Unclassified documents must not be ingested into a regulated knowledge base.

2

Retrieval access controls aligned to data classification

Retrieval must enforce the access controls of the source data — not just the access controls of the RAG system. A user who is not authorised to access a specific document directly must not be able to retrieve it through the RAG query interface. Implement retrieval filters based on the user's data access rights.

3

Output PII and regulated-data detection

Model outputs must be validated before delivery to detect inclusion of regulated data classes that should not appear in the response. Implement PII and regulated-data pattern detection at the output layer. Log detected instances for the DPIA record.

4

Inference data retention and deletion controls

Inference logs containing regulated data — prompts that include regulated-data content and completions that reference it — must meet the same retention and deletion requirements as the source data. Establish a log retention policy that is consistent with the applicable regulatory framework and verify deletion cascades to all subprocessors.

5

DPIA completed and approved

A Data Protection Impact Assessment is required under GDPR Article 35 before deploying RAG over personal data at scale. The DPIA must document the processing purpose, the data flows, the risks identified, and the controls applied. It must be completed and approved by the DPO or a designated approver before the system goes live.

GDPR requirements

GDPR applies to personal data — any information relating to an identified or identifiable natural person. For a RAG system, the relevant GDPR obligations are:

Article 5 — Principles. Personal data must be processed lawfully, fairly, and transparently; collected for specified, explicit, and legitimate purposes and not processed in a manner incompatible with those purposes; adequate, relevant, and limited to what is necessary (data minimisation); and kept accurate and up to date. A RAG system that ingests a broad document corpus without classification or scope limiting violates Article 5(b) (purpose limitation) and 5(c) (data minimisation) by design.

Article 6 — Lawful basis.Processing of personal data requires one of six lawful bases. For most enterprise RAG deployments, the relevant basis is legitimate interests (Article 6(1)(f)), which requires a balancing test demonstrating that the processing is necessary for the legitimate interest pursued and does not override the data subjects' rights. The balancing test must be documented before processing begins.

Article 9 — Special category data. Health data, biometric data, racial or ethnic origin data, and several other categories require an Article 9 condition in addition to a lawful basis. If a RAG knowledge base contains special category data — and many internal document corpora do — the processing of that data must have both a lawful basis and an Article 9 condition documented before ingestion.

Article 35 — DPIA. Where processing is likely to result in a high risk to the rights and freedoms of natural persons, a Data Protection Impact Assessment is required. RAG over personal data at scale almost always triggers this obligation. The DPIA must be completed and signed off by the DPO before the system goes into operation.

HIPAA considerations

HIPAA applies to Protected Health Information (PHI) — individually identifiable health information created, received, maintained, or transmitted by a covered entity or business associate. For a RAG system processing PHI, the relevant obligations are:

Minimum necessary standard. The HIPAA Privacy Rule requires that only the minimum necessary PHI is used or disclosed to accomplish the intended purpose. A RAG system that makes PHI broadly queryable — without limiting query scope to authorised users with a treatment, payment, or healthcare operations purpose — violates the minimum necessary standard.

Access controls.The HIPAA Security Rule requires technical safeguards including access controls for ePHI. In a RAG context, access controls must operate at retrieval time: each user's query must only retrieve PHI they are authorised to access for the specific purpose for which they are accessing it.

Audit controls. The Security Rule requires hardware, software, and procedural mechanisms that record and examine activity in systems that contain ePHI. Every retrieval of PHI through the RAG interface must be logged with sufficient detail to reconstruct who accessed what, when, and for what query.

Business Associate Agreements. If the vector database, embedding service, or LLM provider processes PHI on behalf of a covered entity, a Business Associate Agreement is required. Assess every third-party component in the RAG pipeline for BAA requirements.

Financial data

Financial data in RAG knowledge bases spans several regulatory regimes depending on the data type and the organisation's sector.

Customer financial data under regulations like PSD2, GLBA, and equivalent sector regulations requires the same access control and data minimisation controls as personal data under GDPR, with additional obligations around consent for data sharing and notification for data incidents.

Material non-public information (MNPI) — information that could affect investment decisions and is not publicly available — in a RAG knowledge base creates insider trading risk if accessible to users who should not have it. The access control at retrieval time must enforce information barriers equivalent to those in non-RAG contexts. A query interface that provides MNPI to users without the appropriate access clearance is a securities compliance violation, not just a security finding.

Audit and regulatory records in regulated financial institutions often have records-management obligations (retention, tamper-evidence, discoverability) that are not compatible with the way most RAG systems manage knowledge base content. Before ingesting regulatory records into a RAG corpus, the records management obligations for those records in their original context must be assessed.

Controls by layer

Controls for regulated-data RAG operate at three layers:

Data layer — what enters the knowledge base.Pre-ingestion classification: every document is classified before ingestion, with its regulated data categories, applicable regulatory regime, and required access restrictions. Minimum necessary scoping: the knowledge base is limited to documents necessary for the system's specific purpose — not the full organisational corpus. Retention alignment: documents are not retained in the knowledge base beyond the retention period applicable to their regulated data content.

Retrieval layer — access control at retrieval time.Purpose-bound access control: each user's retrieval scope is limited to documents they are authorised to access for the specific purpose for which they are querying the system. Regulated data categories (PHI, MNPI, special category personal data) are only retrievable by users with an explicit purpose-compatible authorisation. Audit logging: every retrieval of regulated data is logged with user identity, query, and retrieved documents.

Output layer — personal data in responses. PII redaction for out-of-scope outputs: responses to queries outside the authorised scope have personal data redacted. Citation controls: responses that include content from regulated documents include citations, allowing auditors to verify that the response grounding is within scope.

The review checklist

A security review checklist for RAG over regulated data by data class:

All regulated data:

  • Processing basis documented for each regulated data category in scope.
  • Data classification schema applied at ingestion.
  • Knowledge base scope limited to documents necessary for stated purpose.
  • Retrieval-time access control enforced at vector database level.
  • Audit logging of all regulated data retrievals.
  • Data subject rights capability (locate, retrieve, delete by individual).

GDPR-regulated personal data (additional):

  • DPIA completed and signed off before operation.
  • Special category data conditions documented where applicable.
  • Data residency controls for cross-border transfers.

HIPAA-covered PHI (additional):

  • Minimum necessary standard controls documented.
  • BAAs in place for all third-party components processing PHI.
  • Security Rule technical safeguards (access control, audit controls, integrity, transmission security) implemented and documented.

Financial regulated data (additional):

  • Information barriers enforced for MNPI access at retrieval layer.
  • Records management obligations for regulatory records assessed before ingestion.

Evidence requirements

An AI security review of a RAG system over regulated data must produce evidence that covers both the security controls and the regulatory compliance obligations. The evidence set includes:

  • Processing basis documentation for each regulated data category, with legal counsel or DPO sign-off.
  • Data classification schema and ingestion-time classification procedure.
  • Retrieval access control implementation — how purpose-bound access is enforced at the vector database layer.
  • DPIA (for GDPR), with risk assessment, mitigation measures, and residual risk acceptance.
  • BAA list (for HIPAA), covering every third-party component in the pipeline.
  • Audit log configuration — what is logged, where logs are stored, how they are protected.
  • Data subject rights procedure — the documented process for locating and removing an individual's data from the knowledge base.
  • Results of access control testing — boundary tests confirming that regulated data is not retrievable by users outside the authorised scope.

See the Drel RAG security assessment hub for the regulated-data review module with data classification templates and DPIA guidance.

Blog

Get new posts in your inbox

AI security review, OWASP Agentic Top 10, ISO 42001 evidence, and what AI Committees actually need. No cadence promises — we publish when there's something worth reading.

Review your regulated-data RAG system

Drel covers GDPR, HIPAA, and financial data requirements in RAG security assessments — producing the documentation, evidence, and disposition record required for compliance and governance sign-off.

A note on scope: Drel reviews assessed systems against documented architecture, configuration and intent. It does not ingest live telemetry from production environments. Dispositions reflect the assessed system at the time of review and the re-assessment triggers that govern when the disposition must be revisited.