Most data security vendors advocate a methodology whose first step is discovering and classifying your sensitive data. They presume that you cannot define and enforce the right data security policies if you don’t properly discover and classify the data first. Seems like common sense, but at a practical level, this has been notoriously challenging to execute. That is because of this dirty little secret: Data discovery and classification is hard.
Discovery and Classification are Hard to Get Right
There are several reasons why.
First, most discovery and classification tools rely on reference data sets, regexes that match against specific patterns, schema metadata, and even machine learning to look for sensitive data fields. The problem is, data is messy. Data quality issues frustrate such automated approaches, resulting in many false positives requiring human intervention or, even worse, false negatives with sensitive data remaining undiscovered. As a result, they are effective only for applications or datasets that are extremely well-governed, with an organization-wide commitment to keep that data clean and well-curated. This is simply not the reality in many organizations.
Second, many organizations have data classification policies that are top-down theoretical rather than bottom-up operational. Many classification products are designed with this in mind. They guide their users to start with a top-down attribute hierarchy that represents an ideal state, rather than what is actually working in the trenches of the organization’s numerous applications and analytics tools. As a result, the people whose participation is critical to correctly tagging and classifying data – the many data owners and data consumers spread throughout the organization - don’t fully participate in the process. Their day jobs are busy enough as it is, why waste time on some overly complicated governance program that doesn’t really align with their day-to-day reality?
Third, classifications are often context-sensitive, and context can vary across different systems and applications. For example, personally-identifying attributes that might be low-risk in isolation (such as postal code, income level, and other “quasi-identifiers”) become high-risk when appearing in master records comprising everything known about a given person, and then lower-risk again when aggregated and summarized for analytics. So, as data is transferred from one application or data store to another, its purpose and context, and thus its sensitive classification, can change. Once again, human intervention is often needed to address this. The problem is that this kind of data sprawl is common in cloud-native data architectures, and is in fact encouraged by modern best practices for distributed data management.
Of course, the “state of the art” is getting better over time. Machine-learned approaches that incorporate even limited human feedback will gradually get better for that given organization’s data. And generative AI promises to be much more effective at identifying and classifying for sensitivity given the context in which that data appears.
Regardless, these issues conspire to making sensitive data discovery and classification difficult for most organizations. Data security solutions that depend on this first step being done right, will have limited effectiveness.
Another Way to Get Started: Observing Data Access Controls and User Behaviors
Security observability, on the other hand, involves monitoring systems and networks for security threats and vulnerabilities in real-time. This approach involves using tools such as intrusion detection systems, log analysis, user and usage monitoring, and threat intelligence to identify and respond to security incidents. This approach helps detect and respond to issues immediately, minimizing the potential impact of a breach.
For most data sources, observability tools can be a powerful way to get started even if all the data therein has not been fully classified, thus pointing out potential risks and vulnerabilities much more quickly. For example:
- Identify which tables exist that are not being accessed. This “dark data” represents an unnecessary potential risk, and could simply be deleted, regardless of the data therein.
- Identify role explosion, in which there exist many roles with overlapping privileges. This can result in users being over-privileged, or conversely users not able to access relevant data and is not immediately obvious why, given the sheer number of overlapping roles assigned to them. Roles can be reduced and rationalized to be more easily manageable, regardless of the data classifications being accessed.
- Identify “ghost accounts” which have not been accessed in a long time. Former employees, former customers and partners, or current employees who transferred to new jobs, no longer need those accounts. They can be removed, regardless of the data they accessed.
- Monitor usage behavior for unusual activity, such as a spike in data volumes or accessing at unusual times. This may be a bad actor at work and action should be taken, regardless of the data in question.
- Look for evidence that given tables or schemas are being transferred from a given data store to many different targets. This “data sprawl” represents the places to look first for potential exfiltration, over-usage due to lack of access controls, and other risks, regardless of the data being accessed.
Each of these is a potential risk that can be acted upon even without having first classified the given dataset.
LEARN MORE about the different types of data security blind spots and how they can easily happen in any organization: https://www.trustlogix.io/blog/securing-data-in-the-cloud-security-blind-spots-will-hurt-you
Observability vs. Classification: Which Is a Better Starting Point for Data Security?
These two approaches can complement each other. Data classification can help inform security observability by identifying which data is most sensitive and therefore requires the most monitoring, and moreover the security observability solution can use classifications to automate policy enforcement for those specific data fields. Security observability can also help inform data classification by identifying areas of greatest risk that require fine-grained access controls. The best approach will depend on the specific needs of your organization and the types of data you are trying to protect. Ideally one should use a combination.
LEARN MORE about how Classification and Security Observability technologies can together provide a superior data security posture: https://www.trustlogix.io/blog/data-intelligence-data-access-governance-data-centric-security
So, in conclusion, even if your data classification efforts have not been successful, this does not mean your data security program can’t get off the ground. Start with a security observability solution that can help you get started quickly by neutralizing various risks that are classification-agnostic, and also help you identify which data sources and systems are seeing the most anomalous activity.