Data governance

Data governance is usually involved in the early stages of security architecture, focusing on the data.

Data governance starts by considering data as a critical asset of the company. It's a new way of thinking/doing.

  • ๐Ÿฅก Define how data is accessed/shared
  • ๐Ÿ—ƒ๏ธ Define how data is stored
  • ๐Ÿ’ฐ Define how data is used/processed
  • ๐Ÿธ Define how data is managed (quality, retained, deleted...)
  • ...

And, define how ๐Ÿ”‘ data is protected at every step.

We need to identify the needs, requirements, and the risks related to the data, as defined in Risk management, to reduce risks under an acceptable level. Some risks related to data are:

  • ๐Ÿ’ฐ Fines due to non-compliance with legal requirements (GDPR...)
  • ๐Ÿ”ฅ Reputation loss (data breaches usually cause distrust)
  • ๐Ÿ’ฅ Invasion of privacy (leak of data that harm clients)
  • ๐Ÿ”ซ Industrial espionage (a company access our private/confidential data)
  • ๐Ÿƒ Lack of quality (duplicates, loss of time/efficiency, errors...)
  • ...

Data is needed by the company, so we can't "lock it" ๐Ÿ”. This is a challenge of data governance: Protect to enable (Malcolm W. Harkins).


Data governance program

While the goal of the Security Program is to ensure that the organization IT architecture is safe, including data, the goal of the data governance program is to ensure the efficient usage, management, and protection of data to drive business value ๐Ÿš€.

โžก๏ธ Some categories of controls: access, privacy, quality, security...

๐Ÿ‘‰ According to the DAMA framework, we start by identifying the regulations, then we define policies, then standards/directives and guides, then we evaluate the risks and setup procedures.

๐Ÿ‘‰ According to ISO 27001, we first define the scope of the program, the assets, their critical level, their value (impact/loss). Then, we identify the threats, and set up controls and monitoring.

๐Ÿ‘‰ See also the DMBOK framework (Data Management Body of Knowledge) which, unlike the high-level DAMA, is a framework detailing each area of data management.


Inventory data assets

The first step is to inventory data assets, and find problems or elements needing improvement, along with their causes.

  • ๐Ÿ”‘ Identify access control and management of accesses
  • ๐Ÿ’ฐ Identify how data is used, identify patterns...
  • ๐Ÿ” Identify who is responsible, who are the stakeholders
  • ๐Ÿƒ Identify data flows, lifecycle, and environment
  • ๐Ÿ”ฅ Identify a lack of quality (does it fit its purpose?)

โžก๏ธ You can ask representatives from each business unit.

Data life-cycle

It's important to understand or define how data is created, used, stored, archived, and deleted over time, to implement good controls.

  1. Planning ๐Ÿ—บ๏ธ: use laws, directives, requirement, risks to determine what information is needed and for which purpose.
  2. Design and Implementation โš–๏ธ: Define standards, policies, use cases, responsibilities, tests, audits... defining how and where the data will be structured, stored...
  3. Creation/Acquisition ๐Ÿช“: collecting, importing, validating data...
  4. Storage and Maintenance ๐Ÿ—ƒ๏ธ: store data in a maintainable, secure, and accessible location. Do, test, and secure backups...
  5. Usage ๐Ÿ’ธ: in operations, in decision-making...
  6. Enhancement ๐Ÿ’ฐ: add/update/refine data... to make it more relevant to the business needs.

โžก๏ธ At some point, data will be destroyed, most likely before storage.

Identify environments

For environments, data (both virtual and physical) can be found in:

  • Storage (databases, cloud, mobiles, devices, archives...) ๐Ÿ“ฆ: encryption, backup, access control, physical measures...
  • Transit โœˆ๏ธ: encryption & secure protocols (TLS), VPN, MPLS...
  • Utilization ๐Ÿ’ฐ: encryption, access control, monitoring...

Data valuation

Data valuation means giving a value to a data. This value is used to prioritize and classify data, and apply the appropriate controls.

  • ๐Ÿ’ต How much did it cost to acquire this data?
  • โ˜ ๏ธ How much will it cost if we damaged/lost this data?
  • ๐Ÿ”ฅ What's the impact if this data isn't available?
  • ๐Ÿ’ฐ How important is it for business operations/decision-making?
  • ๐Ÿ’Ž What's the revenue generated using this data?
  • ๐Ÿ›ก๏ธ What's the cost to protect this data? (regulations, fines...)
  • ๐Ÿฅท How much would the competitors pay for this data?
  • โณ Is this data new? Old? (newer data are more valuable)
  • ...

๐Ÿ‘‰ The DIKW model (Data, Information, Knowledge, Wisdom) is a pyramid-shaped model that represent how "data" is transformed to "wisdom". The goal is to show understand how raw data can be transformed in useful information that drive business value ๐Ÿ’ต.

Example: Raw data (ex: 1984) is transformed to information by adding a context (ex: Los Angeles summer olympics). By interpreting the information, it becomes knowledge (ex: it occurs every 4 years). And trough reflexion, it become wisdom (ex: the next one will be in 2024).


Structuring the program I

The second step is to define and structure the governance program.

  • ๐Ÿ‘‘ Define the roles and responsibilities of those involved

  • ๐Ÿ›ฃ๏ธ Define a data governance framework tailored for the organization with guidelines for management, security, privacy, quality, compliance, improvements...


Monitoring and improvements

The program must be continuously monitored and improved.

โžก๏ธ See the "Plan - Correct - Monitor - React" (PCMR) framework


Data security

See the ISO 27002 standard.

Metadata management

Metadata are information about each data helping users to understand how each data fits the into the ecosystem.

  • ๐Ÿ› the description of the data
  • ๐Ÿ”  the concept the data represent (ex: a Person, an Address...).
  • ๐Ÿƒ the relation between data and concepts
  • ๐Ÿงฌ how the data is used, stored, retained, and destroyed
  • ๐Ÿฃ the transformations (ex: derived attributes in SQL)
  • ๐Ÿ‘ฎ the validations/quality checks (ex: constraints in SQL)
  • ๐Ÿ”‘ how much is the data important/needed for the company
  • ๐Ÿ‘‘ who is the owner, the responsible/attendants
  • ๐Ÿชฆ what's the origin of the data
  • ๐Ÿฆ— who can access this data, and what they can do with it

โš ๏ธ It must be clear where are Metadata stored and what's inside.

Glossary of terms

A glossary of terms is used to describe every term (ex: Client)

  • A description of what the term is, descriptive, unambiguous, with hyperlink to other terms (see ISO-11179)
  • Add abbreviations, acronyms, synonyms, translations...
  • Add any relation between terms
  • Add management information (When was this term added? By who? Who approved it? ...)

Usually, we use a taxonomy to classify terms (function, product category, business unit/process...).

  • Governance terms ๐Ÿ‘‘: for instance, "deprecated" or "retired" that could be used to indicate the state of semantic terms.
  • Semantic terms ๐Ÿ’ต: business terms such as "customer", "invoice", "product"... Each term is often categorized into entities and properties. For example, the "customer" entity might have properties such as "name", and "address".

๐Ÿ‘‰ Entities can be divided in: things, activities, and actors. Properties can be divided in: identifiers, attributes, and conditions.

Data models

Data models are used to explain in a standardized way something. According to the target, the model will have more or less details

  • conceptual models: high-level concepts
  • logical models: relationship between data elements
  • physical models: how the data is stored (ex: UML diagrams)

Structuring the program II

Data classification

Elements are in the same class if they share the same controls. A common classification of data could be:

  • Public: information that won't harm the company
  • Sensitive: information that should be closely monitored but won't harm much the company if disclosed (ex: upcoming news disclosed)
  • Private: information that will impact users or the company if disclosed (ex: browser history)
  • Confidential: information that will harm users or the company if disclosed (ex: credit card data)

โžก๏ธ Clearly define what information is in which level. It's usually based on the valuation. You should not have more than 7.

๐Ÿ”ฅ An organization may add a class or a subclass in every class called Critical data for data essential for the organization to operate. They are usually identified when performing a risk assessment.

๐Ÿงจ An organization may also add a class or a subclass in every class called Sensitive data for data that may harm users or the organization if disclosed/accessed by unauthorized individuals..

๐Ÿ’ฃ Beware that using inference, non-confidential information such as birthdate, postal code, and gender can allow us to find someone.

An additional classification or subclassification could be the one below. At the top, the semantic is the most important as it will impact the quality of the element below. Problems at the top will impact a lot the business. The more we go down, the more we have data.

  1. ๐Ÿ“ Metadata: describe what kind of data we have
  2. ๐Ÿ—๏ธ Reference data: data used as a reference for other data, such as a list of countries, ranks for customers (iron, gold, diamond)...
  3. ๐Ÿ’ฐ Structural data: data about external entities (ex: providers, clients...) and the data related to the service/product (ex: price...)
  4. ๐Ÿข Organizational data: data about the company (ex: employees, sales inventory, departments, hierarchies...)
  5. ๐Ÿ’ต Operational data: data generated by the activity of the company (ex: orders, invoices...)
  6. ๐Ÿชจ Audit data: logs of every change of the data

We call Master data the structural, organizational, and operational data altogether. This is the core and critical data of the company which is considered the single source of truth.


Structuring the program III

Data quality

Data is of quality if it fits it purpose. It means that even incomplete data could pass this test, as long as it fits what the organization need.

  1. ๐Ÿ‘‰ Accurate: represent the truth (in real life)
  2. ๐Ÿฅก Complete: every entity and required properties are present
  3. ๐Ÿชž Consistent: uniform between datasources
  4. ๐Ÿƒ Referential integrity: elements are correctly linked
  5. โšก Up-to-date: promptly updated
  6. ๐Ÿงฌ Unique: no duplicates
  7. โœ… Valid: within the expected range...
  8. ๐Ÿ’ฐ Relevant: useful for the organization
  9. ๐Ÿ‘ฎ Trustworthy: the source is known
  10. โœˆ๏ธ Available: those who need it can access it
  11. ๐Ÿ”’ Protected: only those allowed can access it
  12. ๐Ÿ™‹ Understandable: the definition is both shared and clear

Usually, data quality must be ensured when creating, storing, and using the data. It's important to investigate the cause of low quality data ๐Ÿชฒ, and to prioritize problems to mitigate.

A lack of quality may result in

  • ๐Ÿฆฅ delays (due to duplicates, incorrect, and incomplete data)
  • ๐Ÿ› errors (due to duplicates, incorrect, and incomplete data)
  • ๐Ÿ’ณ financial losses (cost to fix, loss of clients due to delays/errors)

A few techniques to detect a lack of quality are:

  • ๐Ÿง detect outliers/extremums
  • ๐Ÿงž ask the ones using the data

You may have to design validation/verification processes, and will have to design processes to monitor data quality over time.

โžก๏ธ Data cleansing, deduplication, normalization...


Structuring the program IV

Data Anonymization Policy

Anonymization may be performed in many scenarios

  • ๐Ÿ“ฌ when sharing information with partners/third-party vendors
  • ๐Ÿ“š when conducting a research (statistics...)
  • ๐Ÿ“ˆ when conducting a market analysis (trend...)
  • ๐Ÿ“ to comply with regulations
  • ๐Ÿ—๏ธ to create a test dataset
  • ๐ŸŒ to generate and share open data
  • ...

Most personally identifiable information (PII) or information such as firstname, lastname, address, phone number, official documents/identifiers, birthdate, gender, race... is anonymized.

โžก๏ธ Basically, any sensitive data that can be used to identify someone.

There are many techniques that are usually combined:

  • ๐ŸŒ Generalization: group value in groups (ex: age ranges)
  • ๐Ÿšฝ Suppression: remove a record (line) or an attribute (column)
  • ๐ŸŽฌ Data swapping: swap some data between records
  • ๐Ÿซฅ Data masking: partially obscuring data (ex: 1976-XX-XX)
  • ๐Ÿค Data minimization: reduce as possible the dataset length
  • ๐Ÿฅ‚ Caviar: replace data with a "caviar" term (ex: John Doe)
  • ๐ŸŽญ Pseudo-anonymization: replace a PII with a random string that usually matches the format, but it's the real one
  • ๐Ÿ“ Adjustment: apply a modification to group values (round off...)
  • ๐Ÿ–จ๏ธ Aggregation: provide computed data (mean of salaries...)
  • ๐Ÿ”‘ Hashing: hash emails, passwords, IPs, ids...

There are also: Noise addition ๐Ÿ”Š, Perturbation ๐Ÿฅž, and Differential privacy ๐Ÿ‘ฃ which are adding noise to data in different way, to prevent any form of identification.


Structuring the program V

Data Retention Policy

Each data should have a retention policy describing

  • ๐Ÿ” how long they are retained
  • ๐Ÿฆ where (location, based on classification?, chronologically?)
  • ๐Ÿ“š how (format, media)
  • ๐Ÿ‘ท who will archive/manage it
  • ๐Ÿง‘โ€๐Ÿญ who will ensure it's destroyed
  • ...

These choices are mainly based on legal requirements.

Data Destruction Policy

The destruction policy describes

  • ๐Ÿ“ what data should be destroyed (both physical and virtual)
  • ๐Ÿฃ when data should be destroyed (ex: frequency)
  • ๐Ÿ’ฅ how data should be destroyed (based on classification...)
  • ๐Ÿง‘โ€๐Ÿญ who will destroy it

It's important to ensure that data is erased beyond recovery ๐Ÿ”. For instance, emptying the recycle bin in NOT enough. The disk should be defragmented or destroyed. Same for databases...


Data management

One of the goal of data governance by defining data quality policies, metadata management policies... to ensure an efficient data management. There are five levels of data management:

  1. Ad hoc ๐Ÿ’: individuals manage data in their own way, there is no structure nor consistence.
  2. Repeatable ๐Ÿ“: departments establish basic policies/procedures such as formats, backups policies...
  3. Defined ๐Ÿง: the organization start formalizing data management, this include data quality policies, roles, responsabilities...
  4. Managed โš–๏ธ: data is actively monitored and inventoried
  5. Optimized ๐Ÿ’ฐ๐Ÿ’ต: data management is implicit in all business processes, and continuously improved

The goal is to move from the lower levels to the higher levels, where data is managed effectively and driving business value.

๐Ÿ‘‰ Gartner released a research report provides guidelines for improving data management called "Generally Accepted Information Principles for Improved Information Asset Management".

โžก๏ธ Rules and standards are usually define in these

  • Principles: high-level guidance (why do we need this data?)
  • Policies: more specific rules (what do we need?)
  • Guidelines: recommendations and best practices (how?)
  • Standards: technical specifications (quality...)

๐Ÿ‘ป To-do ๐Ÿ‘ป

Stuff that I found, but never read/used yet.

  • Retention policy, a business continuity plan (step 4 of data life-cycle)
  • Data Stewardship
  • Test data
  • Information architecture triangle
  • minimize data redundancy to ensure that data is accurate, consistent, and secure
  • misused or mishandled

Data breaches

  • The company must find, and patch the vulnerability
  • The company must inform the clients
  • The company will have to pay fines
  • The company will lose reputation

To prevent data breaches, the company after identifying, classifying, and prioritizing data, should find why data is likely to be targeted along the risks, and define required protections.

  • data recovery plans
  • Data Loss Prevention (DLP)
  • communication plans
  • data breach notification laws