Most data security vendors advocate a methodology whose first step is discovering and classifying your sensitive data. They presume that you cannot define and enforce the right data security policies if you don’t properly discover and classify the data first. Seems like common sense, but at a practical level, this has been notoriously challenging to execute. That is because of this dirty little secret: Data discovery and classification is hard.
There are several reasons why.
First, most discovery and classification tools rely on reference data sets, regexes that match against specific patterns, schema metadata, and even machine learning to look for sensitive data fields. The problem is, data is messy. Data quality issues frustrate such automated approaches, resulting in many false positives requiring human intervention or, even worse, false negatives with sensitive data remaining undiscovered. As a result, they are effective only for applications or datasets that are extremely well-governed, with an organization-wide commitment to keep that data clean and well-curated. This is simply not the reality in many organizations.
Second, many organizations have data classification policies that are top-down theoretical rather than bottom-up operational. Many classification products are designed with this in mind. They guide their users to start with a top-down attribute hierarchy that represents an ideal state, rather than what is actually working in the trenches of the organization’s numerous applications and analytics tools. As a result, the people whose participation is critical to correctly tagging and classifying data – the many data owners and data consumers spread throughout the organization - don’t fully participate in the process. Their day jobs are busy enough as it is, why waste time on some overly complicated governance program that doesn’t really align with their day-to-day reality?
Third, classifications are often context-sensitive, and context can vary across different systems and applications. For example, personally-identifying attributes that might be low-risk in isolation (such as postal code, income level, and other “quasi-identifiers”) become high-risk when appearing in master records comprising everything known about a given person, and then lower-risk again when aggregated and summarized for analytics. So, as data is transferred from one application or data store to another, its purpose and context, and thus its sensitive classification, can change. Once again, human intervention is often needed to address this. The problem is that this kind of data sprawl is common in cloud-native data architectures, and is in fact encouraged by modern best practices for distributed data management.
Of course, the “state of the art” is getting better over time. Machine-learned approaches that incorporate even limited human feedback will gradually get better for that given organization’s data. And generative AI promises to be much more effective at identifying and classifying for sensitivity given the context in which that data appears.
Regardless, these issues conspire to making sensitive data discovery and classification difficult for most organizations. Data security solutions that depend on this first step being done right, will have limited effectiveness.
Security observability, on the other hand, involves monitoring systems and networks for security threats and vulnerabilities in real-time. This approach involves using tools such as intrusion detection systems, log analysis, user and usage monitoring, and threat intelligence to identify and respond to security incidents. This approach helps detect and respond to issues immediately, minimizing the potential impact of a breach.
For most data sources, observability tools can be a powerful way to get started even if all the data therein has not been fully classified, thus pointing out potential risks and vulnerabilities much more quickly. For example:
Each of these is a potential risk that can be acted upon even without having first classified the given dataset.
LEARN MORE about the different types of data security blind spots and how they can easily happen in any organization: https://www.trustlogix.io/blog/securing-data-in-the-cloud-security-blind-spots-will-hurt-you
LEARN MORE about how Classification and Security Observability technologies can together provide a superior data security posture: https://www.trustlogix.io/blog/data-intelligence-data-access-governance-data-centric-security
So, in conclusion, even if your data classification efforts have not been successful, this does not mean your data security program can’t get off the ground. Start with a security observability solution that can help you get started quickly by neutralizing various risks that are classification-agnostic, and also help you identify which data sources and systems are seeing the most anomalous activity.