Shadow Data Is Inevitable, But Security Risks Aren’t

Dec 12, 2023
6 minutes

Shadow data is inevitable, particularly with the shift to cloud and data democratization. The ease of creating shadow data assets and the potential for faster insights incentivize employees. While generally not concerning, issues arise when sensitive data becomes involved. To harden your cloud data security posture and prevent, for example, storing confidential financial information in an unmonitored database, you need to get ahead of potential oversights. The challenge in proactive data management lies in striking a balance between security, agility, and data democratization.

Shadow Data and the Cloud: A Match Made in Heaven

Organizations love the public cloud because it makes everything easy to deploy. Rather than petitioning a centralized IT team to allocate resources for a new data initiative, smaller dev or analytics teams can spin up new cloud resources and start filling them with data.

Data democratization and business agility implicitly encourage shadow data. These principles essentially boil down to smaller teams accessing data independently — bypassing traditional gatekeepers in IT, DevOps or DBA departments. Marketing analysts might move customer data to Google BigQuery to analyze product usage patterns while support teams store a copy of the same data in Snowflake as part of a ticket NLP project. These tools can be set up in a few clicks and require little knowledge beyond SQL.

But when infrastructure is easier to deploy, it’s harder to monitor. And while infrastructure security has come a long way, it struggles to catch up with the rate in which shadow data is created in the cloud — which can easily lead to a leak of sensitive or regulated information.

‍Shadow data, as an ‘unknown unknown’, poses unique risks. Beyond not knowing where to find it, security teams don’t know they should look for it. And by definition, sensitive data being stored in a shadow datastore isn't subject to the organization’s standard security policies and isn't being monitored.

Common Shadow Data Scenarios

Shadow data can be generated as part of testing, backup, cloud migration or in regular business operations. In many cases, this can actually help teams accomplish more, faster — and we wouldn’t necessarily want to discourage this. But if sensitive data is being forgotten or abandoned, it poses an exciting opportunity for cyberattackers looking to steal data or commit ransomware attacks.

Let's talk about where to look for shadow data, based on real-life scenarios we’ve encountered in our work with customers.

Object Storage (AWS S3, Google Cloud Storage, Azure Blob) — the Biggest Culprit

Unstructured, inexpensive and typically accessible, object storage tends to be the biggest component of the organizational data estate. Even though it's the obvious suspect for hidden shadow data, building effective detection mechanisms is challenging — and the shadow data often goes unnoticed.

Consider a data scientist using Databricks to run a specific, one-off transformation to answer a business question. They then store the results in S3 for potential future analysis. If done using anonymized, non-PII data, there’s no issue. It poses a security and compliance problem, though, if they create a copy of customer credit card information.

How about a company that dumps its Redis instance, which contains PII, into an unencrypted S3 bucket? The security team is unaware of the problem — it’s just one more S3 bucket with an inconspicuous name. But a malicious actor with access to the cloud environment has no problem accessing the data.

Unmanaged Datastores: A Complete Black Box

In a world of on-demand compute, it’s impossible for security teams to monitor every new VM. Should someone use these machines to run databases, you now have data assets with contents invisible to most security solutions.

Imagine a developer trying to solve a data quality issue. They spin up a new Postgres instance and fill it with production data to run a test. Maybe they use a snapshot, maybe they use an automatically updated replica of a database containing sensitive information. On completion of the project, they should delete the database, but — common oversight — they leave the database running.

A company has employee turnover resulting in an unmanaged MariaDB instance in their cloud environment. The database contains hundreds of gigabytes of data copied from production a year earlier, including thousands of electronic health records. While the database is no longer running, the data remains — dormant and ripe for attack. The CSPM platform alerts the organization to the database’s existence, but the team, unaware of the scale of sensitive data it contains, doesn’t consider it a priority.

Duplicates on Managed Datastores

Partitions, snapshots, staging tables and ELT jobs will often lead to duplicate and triplicate copies of data created in cloud data assets like BigQuery and Snowflake. While these tools have some built-in monitoring, the sheer number of services and copies can make monitoring close to impossible.

The Need for Data-Centric Security

While the above examples lean toward egregious, forgotten database dumps and orphaned snapshots in cloud environments do happen — all too frequently.

When data plays a major part in business processes, you’ll have more people doing more things with more data. As the organization grows, keeping track of this data becomes immensely challenging without the right solution.

If you can’t eliminate shadow data, what can you do about it? How can you ensure that it’s not creating a security liability?

Policy and posture are vital, but they’re a first step and not the final. Data security posture management (DSPM) classifies data contained within the database or file storage, whether managed or unmanaged. By scanning the actual records, DSPM can detect and prioritize shadow data anywhere. Highlighting and classifying sensitive records allows security teams to focus on data assets that pose the largest risk, either for security or compliance reasons.

Data detection and response (DDR) completes the picture by providing real-time monitoring of data assets, allowing security teams to quickly intervene when unwanted actions are underway. Organizations are equipped to mitigate long-forgotten dataset suddenly copied to S3 or a snapshot suspiciously taken of a production database.

By combining posture management, static risk detection, and dynamic monitoring, companies can gain visibility and control while supporting data driven operations at scale.

Is Your Shadow Data a Problem?

To assess your shadow data situation, ask yourself these questions:

  1. Do you have an automated discovery tool that can notify you of new sensitive data assets in your environments and the safeguards around them?
  2. Can you protect your cloud data without hindering development or infrastructure performance?
  3. Would you be alerted in real time on suspicious actions involving your sensitive data?

Learn More

DSPM with data detection and response (DDR) offers critical capabilities previously missing in the cloud security landscape — data discovery, classification, static risk management, and continuous and dynamic monitoring of complex, multicloud environments. Learn how to secure your sensitive data in the cloud with our definitive DSPM resource. Download The Big Guide to DSPM with DDR today.

Subscribe to Cloud Native Security Blogs!

Sign up to receive must-read articles, Playbooks of the Week, new feature announcements, and more.