Data governance

Data governance is usually involved in the early stages of security architecture, focusing on the data.

Data governance starts by considering data as a critical asset of the company. It's a new way of thinking/doing.

🥡 Define how data is accessed/shared
🗃️ Define how data is stored
💰 Define how data is used/processed
🐸 Define how data is managed (quality, retained, deleted...)
...

And, define how 🔑 data is protected at every step.

We need to identify the needs, requirements, and the risks related to the data, as defined in Risk management, to reduce risks under an acceptable level. Some risks related to data are:

💰 Fines due to non-compliance with legal requirements (GDPR...)
🔥 Reputation loss (data breaches usually cause distrust)
💥 Invasion of privacy (leak of data that harm clients)
🔫 Industrial espionage (a company access our private/confidential data)
🍃 Lack of quality (duplicates, loss of time/efficiency, errors...)
...

Data is needed by the company, so we can't "lock it" 🔐. This is a challenge of data governance: Protect to enable (Malcolm W. Harkins).

Data governance program

While the goal of the Security Program is to ensure that the organization IT architecture is safe, including data, the goal of the data governance program is to ensure the efficient usage, management, and protection of data to drive business value 🚀.

➡️ Some categories of controls: access, privacy, quality, security...

👉 According to the DAMA framework, we start by identifying the regulations, then we define policies, then standards/directives and guides, then we evaluate the risks and setup procedures.

👉 According to ISO 27001, we first define the scope of the program, the assets, their critical level, their value (impact/loss). Then, we identify the threats, and set up controls and monitoring.

👉 See also the DMBOK framework (Data Management Body of Knowledge) which, unlike the high-level DAMA, is a framework detailing each area of data management.

Inventory data assets

The first step is to inventory data assets, and find problems or elements needing improvement, along with their causes.

🔑 Identify access control and management of accesses
💰 Identify how data is used, identify patterns...
🔐 Identify who is responsible, who are the stakeholders
🍃 Identify data flows, lifecycle, and environment
🔥 Identify a lack of quality (does it fit its purpose?)

➡️ You can ask representatives from each business unit.

Data life-cycle

It's important to understand or define how data is created, used, stored, archived, and deleted over time, to implement good controls.

Planning 🗺️: use laws, directives, requirement, risks to determine what information is needed and for which purpose.
Design and Implementation ⚖️: Define standards, policies, use cases, responsibilities, tests, audits... defining how and where the data will be structured, stored...
Creation/Acquisition 🪓: collecting, importing, validating data...
Storage and Maintenance 🗃️: store data in a maintainable, secure, and accessible location. Do, test, and secure backups...
Usage 💸: in operations, in decision-making...
Enhancement 💰: add/update/refine data... to make it more relevant to the business needs.

➡️ At some point, data will be destroyed, most likely before storage.

Identify environments

For environments, data (both virtual and physical) can be found in:

Storage (databases, cloud, mobiles, devices, archives...) 📦: encryption, backup, access control, physical measures...
Transit ✈️: encryption & secure protocols (TLS), VPN, MPLS...
Utilization 💰: encryption, access control, monitoring...

Data valuation

Data valuation means giving a value to a data. This value is used to prioritize and classify data, and apply the appropriate controls.

💵 How much did it cost to acquire this data?
☠️ How much will it cost if we damaged/lost this data?
🔥 What's the impact if this data isn't available?
💰 How important is it for business operations/decision-making?
💎 What's the revenue generated using this data?
🛡️ What's the cost to protect this data? (regulations, fines...)
🥷 How much would the competitors pay for this data?
⏳ Is this data new? Old? (newer data are more valuable)
...

👉 The DIKW model (Data, Information, Knowledge, Wisdom) is a pyramid-shaped model that represent how "data" is transformed to "wisdom". The goal is to show understand how raw data can be transformed in useful information that drive business value 💵.

Example: Raw data (ex: 1984) is transformed to information by adding a context (ex: Los Angeles summer olympics). By interpreting the information, it becomes knowledge (ex: it occurs every 4 years). And trough reflexion, it become wisdom (ex: the next one will be in 2024).

Structuring the program I

The second step is to define and structure the governance program.

👑 Define the roles and responsibilities of those involved
🛣️ Define a data governance framework tailored for the organization with guidelines for management, security, privacy, quality, compliance, improvements...

Monitoring and improvements

The program must be continuously monitored and improved.

➡️ See the "Plan - Correct - Monitor - React" (PCMR) framework

Data security

See the ISO 27002 standard.

Metadata management

Metadata are information about each data helping users to understand how each data fits the into the ecosystem.

🍛 the description of the data
🔠 the concept the data represent (ex: a Person, an Address...).
🍃 the relation between data and concepts
🧬 how the data is used, stored, retained, and destroyed
🐣 the transformations (ex: derived attributes in SQL)
👮 the validations/quality checks (ex: constraints in SQL)
🔑 how much is the data important/needed for the company
👑 who is the owner, the responsible/attendants
🪦 what's the origin of the data
🦗 who can access this data, and what they can do with it

⚠️ It must be clear where are Metadata stored and what's inside.

Glossary of terms

A glossary of terms is used to describe every term (ex: Client)

A description of what the term is, descriptive, unambiguous, with hyperlink to other terms (see ISO-11179)
Add abbreviations, acronyms, synonyms, translations...
Add any relation between terms
Add management information (When was this term added? By who? Who approved it? ...)

Usually, we use a taxonomy to classify terms (function, product category, business unit/process...).

Governance terms 👑: for instance, "deprecated" or "retired" that could be used to indicate the state of semantic terms.
Semantic terms 💵: business terms such as "customer", "invoice", "product"... Each term is often categorized into entities and properties. For example, the "customer" entity might have properties such as "name", and "address".

👉 Entities can be divided in: things, activities, and actors. Properties can be divided in: identifiers, attributes, and conditions.

Data models

Data models are used to explain in a standardized way something. According to the target, the model will have more or less details

conceptual models: high-level concepts
logical models: relationship between data elements
physical models: how the data is stored (ex: UML diagrams)

Structuring the program II

Data classification

Elements are in the same class if they share the same controls. A common classification of data could be:

Public: information that won't harm the company
Sensitive: information that should be closely monitored but won't harm much the company if disclosed (ex: upcoming news disclosed)
Private: information that will impact users or the company if disclosed (ex: browser history)
Confidential: information that will harm users or the company if disclosed (ex: credit card data)

➡️ Clearly define what information is in which level. It's usually based on the valuation. You should not have more than 7.

🔥 An organization may add a class or a subclass in every class called Critical data for data essential for the organization to operate. They are usually identified when performing a risk assessment.

🧨 An organization may also add a class or a subclass in every class called Sensitive data for data that may harm users or the organization if disclosed/accessed by unauthorized individuals..

💣 Beware that using inference, non-confidential information such as birthdate, postal code, and gender can allow us to find someone.

An additional classification or subclassification could be the one below. At the top, the semantic is the most important as it will impact the quality of the element below. Problems at the top will impact a lot the business. The more we go down, the more we have data.

📝 Metadata: describe what kind of data we have
🗝️ Reference data: data used as a reference for other data, such as a list of countries, ranks for customers (iron, gold, diamond)...
💰 Structural data: data about external entities (ex: providers, clients...) and the data related to the service/product (ex: price...)
🏢 Organizational data: data about the company (ex: employees, sales inventory, departments, hierarchies...)
💵 Operational data: data generated by the activity of the company (ex: orders, invoices...)
🪨 Audit data: logs of every change of the data

We call Master data the structural, organizational, and operational data altogether. This is the core and critical data of the company which is considered the single source of truth.

Structuring the program III

Data quality

Data is of quality if it fits it purpose. It means that even incomplete data could pass this test, as long as it fits what the organization need.

👉 Accurate: represent the truth (in real life)
🥡 Complete: every entity and required properties are present
🪞 Consistent: uniform between datasources
🍃 Referential integrity: elements are correctly linked
⚡ Up-to-date: promptly updated
🧬 Unique: no duplicates
✅ Valid: within the expected range...
💰 Relevant: useful for the organization
👮 Trustworthy: the source is known
✈️ Available: those who need it can access it
🔒 Protected: only those allowed can access it
🙋 Understandable: the definition is both shared and clear

Usually, data quality must be ensured when creating, storing, and using the data. It's important to investigate the cause of low quality data 🪲, and to prioritize problems to mitigate.

A lack of quality may result in

🦥 delays (due to duplicates, incorrect, and incomplete data)
🐛 errors (due to duplicates, incorrect, and incomplete data)
💳 financial losses (cost to fix, loss of clients due to delays/errors)

A few techniques to detect a lack of quality are:

🧐 detect outliers/extremums
🧞 ask the ones using the data

You may have to design validation/verification processes, and will have to design processes to monitor data quality over time.

➡️ Data cleansing, deduplication, normalization...

Structuring the program IV

Data Anonymization Policy

Anonymization may be performed in many scenarios

📬 when sharing information with partners/third-party vendors
📚 when conducting a research (statistics...)
📈 when conducting a market analysis (trend...)
📝 to comply with regulations
🏗️ to create a test dataset
🌍 to generate and share open data
...

Most personally identifiable information (PII) or information such as firstname, lastname, address, phone number, official documents/identifiers, birthdate, gender, race... is anonymized.

➡️ Basically, any sensitive data that can be used to identify someone.

There are many techniques that are usually combined:

🌍 Generalization: group value in groups (ex: age ranges)
🚽 Suppression: remove a record (line) or an attribute (column)
🎬 Data swapping: swap some data between records
🫥 Data masking: partially obscuring data (ex: 1976-XX-XX)
🤏 Data minimization: reduce as possible the dataset length
🥂 Caviar: replace data with a "caviar" term (ex: John Doe)
🎭 Pseudo-anonymization: replace a PII with a random string that usually matches the format, but it's the real one
📍 Adjustment: apply a modification to group values (round off...)
🖨️ Aggregation: provide computed data (mean of salaries...)
🔑 Hashing: hash emails, passwords, IPs, ids...

There are also: Noise addition 🔊, Perturbation 🥞, and Differential privacy 👣 which are adding noise to data in different way, to prevent any form of identification.

Structuring the program V

Data Retention Policy

Each data should have a retention policy describing

🔐 how long they are retained
🏦 where (location, based on classification?, chronologically?)
📚 how (format, media)
👷 who will archive/manage it
🧑‍🏭 who will ensure it's destroyed
...

These choices are mainly based on legal requirements.

Data Destruction Policy

The destruction policy describes

📝 what data should be destroyed (both physical and virtual)
🐣 when data should be destroyed (ex: frequency)
💥 how data should be destroyed (based on classification...)
🧑‍🏭 who will destroy it

It's important to ensure that data is erased beyond recovery 🔐. For instance, emptying the recycle bin in NOT enough. The disk should be defragmented or destroyed. Same for databases...

Data management

One of the goal of data governance by defining data quality policies, metadata management policies... to ensure an efficient data management. There are five levels of data management:

Ad hoc 🐒: individuals manage data in their own way, there is no structure nor consistence.
Repeatable 📝: departments establish basic policies/procedures such as formats, backups policies...
Defined 🧐: the organization start formalizing data management, this include data quality policies, roles, responsabilities...
Managed ⚖️: data is actively monitored and inventoried
Optimized 💰💵: data management is implicit in all business processes, and continuously improved

The goal is to move from the lower levels to the higher levels, where data is managed effectively and driving business value.

👉 Gartner released a research report provides guidelines for improving data management called "Generally Accepted Information Principles for Improved Information Asset Management".

➡️ Rules and standards are usually define in these

Principles: high-level guidance (why do we need this data?)
Policies: more specific rules (what do we need?)
Guidelines: recommendations and best practices (how?)
Standards: technical specifications (quality...)

👻 To-do 👻

Stuff that I found, but never read/used yet.

Retention policy, a business continuity plan (step 4 of data life-cycle)
Data Stewardship
Test data
Information architecture triangle
minimize data redundancy to ensure that data is accurate, consistent, and secure
misused or mishandled

Data breaches

The company must find, and patch the vulnerability
The company must inform the clients
The company will have to pay fines
The company will lose reputation

To prevent data breaches, the company after identifying, classifying, and prioritizing data, should find why data is likely to be targeted along the risks, and define required protections.

data recovery plans
Data Loss Prevention (DLP)
communication plans
data breach notification laws

Data governance ¶

Data governance program ¶

Inventory data assets ¶

Data life-cycle ¶

Identify environments ¶

Data valuation ¶

Structuring the program I ¶

Monitoring and improvements ¶

Data security ¶

Metadata management ¶

Structuring the program II ¶

Data classification ¶

Structuring the program III ¶

Data quality ¶

Structuring the program IV ¶

Data Anonymization Policy ¶

Structuring the program V ¶

Data Retention Policy ¶

Data Destruction Policy ¶

Data management ¶

👻 To-do 👻 ¶