Open-Source Classifiers: Limited Tools for Modern Data Security

Written by Saahil Shenoy | Jul 11, 2024 3:17:37 PM

Data security is a process of gaining increasing visibility and control over your data. As an essential first step, you must find your data…all your data.

This is data discovery. But simply finding your data is not enough. You need to be able to understand your data: what it is, how it is used, and, most importantly, which data is sensitive. By understanding your data, you will be able to understand your security vulnerabilities and compliance risks. This then serves as the foundation to the rest of your data management efforts to control access, protect data, and reduce your risks.

Open-Source Classifiers: Free But Expensive to Make Accurate

Organizations have many options to discover and classify their data. Open-source data classifiers are one common option. Unfortunately, they are not well-designed for supporting dependable data security efforts.

While cost-effective with no licensing fees, they require in-house expertise to configure, integrate, deploy, and tune for accurately understanding your various data types and business uses. And, of course, the software will vary in performance and support services.

Open-source classifiers are designed to be a one-size-fits-all kind of software. They are built for the lowest common denominator of usage to make them adaptable for a wide-range of industries and use cases.

But that often leaves quite a bit of customization necessary to make the classifier accurate enough for the particular data security requirements of an organization.

They are certainly handy tools for simple tasks and basic data classification needs, but they won’t likely be as accurate, especially when dealing with less common data types that are specific to your business.

Overall, the manual tuning required to create really accurate classifiers that are also hyper-efficient to scale is substantial and should not be underestimated.

In fact, many DSPM vendors use open-source classifiers which immediately limits their capabilities with the challenges noted thus far. Bedrock does not.

Bringing Business Context to Data Classification

The Bedrock Security Platform was created to address the critical needs of accurate, efficient, and ongoing data classification in ways that have been difficult, if not impossible, to achieve with open-source classifiers and traditional data security posture management (DSPM) tools.

Most importantly, the Bedrock AI Reasoning (AIR) Engine can understand the business context of each organization and its specific data types and information categories.

Such advanced understanding of an organization’s data makes it far easier to dependably identify sensitive data and greatly reduce the number of false positives, especially compared to open-source classifiers.

For example, if we know that an organization is based in the UK, we can adapt our machine learning and AI to apply business logic for identifying different data types that would be most appropriate for a UK-based company rather than a US company. We use this same sort of business intelligence to more accurately and precisely identify data types and information categories for organizations across a wide range of industries.

The AIR Engine enriches data classification by discerning the business context of a piece of data. For example, it doesn’t just report on the location of social security numbers (SSNs). It understands the kind of documents they are in.

An SSN in a W-2 tax form requires a different set of access policies compared to a credit application. In the former case, only human resources personnel should have access to those documents while in the latter case, only certain personnel in the finance department should be able to see those documents.

For less common or unique data types, organizations can use Bedrock to easily create their own categories, particularly for data associated with valuable intellectual property, such as genetic sequences or proprietary algorithms for signal processing. With Bedrock, you can efficiently discover and classify all your data to ensure a highly accurate foundation for your data security efforts.

And we’re not going to require you to keep telling us the same thing over and over. The Bedrock AIR Engine is very good at recognizing similar data sets once it learns their business context. We learn from the data the first time.

The Bedrock AIR Engine also ensures our customers' privacy. Unlike other open-source and commercial products, Bedrock does not move any source data from customer data stores. The AIR Engine only sends metadata back to its own platform, after finding and categorizing data within customer datastores.

Bedrock Fingerprinting: Tracking How Data Changes

Being able to identify original data isn’t enough. One of the biggest challenges of protecting data is not only identifying the original data but also recognizing that data when it has been copied and altered. Our patented “fingerprinting” technology makes it possible for us to provide this kind of data lineage tracing.

Bedrock’s data fingerprinting provides exceptional protection for an organization’s unique business information and intellectual property.

By harnessing its ability to understand the unique structure of data and what it actually means, including its business context, Bedrock learns the “fingerprint” of a piece of data, even if it is buried deep within other data sets.

For example, Bedrock is working with a life sciences company to protect its proprietary genetic sequences. With our fingerprinting technology, we can understand on a protein level what the actual individual characters mean in the context of a synthetic DNA sequence. This allows the Bedrock platform to recognize this type of data, regardless of where it might travel.

We are also working with an audio company that has developed its own algorithms for digital signal processing. With our fingerprinting technology, we can differentiate between that code and any other software code the company uses.

Going back to the example of our life sciences customer, we are able to safeguard its 10,000+ gene sequences by being able to recognize copies and derivative versions to ensure no part of those gene sequences leak from the organization or move into inappropriate data stores.

Tracking data lineage can be particularly helpful with organizations that undergo mergers and acquisitions. Our ability to identify copies and altered versions of IP data can help keep the IP data of an acquired company from accidentally leaking into the data stores of other acquired companies under the broader parent organization.

Data Understanding Leads to More Effective Security Policies

By understanding the greater context for a piece of data, organizations can apply more effective access controls to ensure least-privilege access to any given data set.

This advanced, fine-grained understanding of the data context makes it far easier to orchestrate permissions, which is the ultimate goal for improving data security and regulatory compliance.

To learn more about how the Bedrock Policy Engine can help improve your data security, please read about the technology of the Bedrock platform or speak with our data security experts.

View full post