## **1. Data Generation / Sources**
Where data originates from — could be applications, devices, logs, APIs, etc.
- **Data Sources** – Refers to the origin points of data like databases, REST APIs, file systems, or streaming platforms.
- **CDC (Change Data Capture)** – Captures changes in source data systems in real time.
- **OLTP (Online Transaction Processing)** – Handles real-time transactions (e.g., banking, e-commerce).
## **2. Data Ingestion**
Bringing data into your ecosystem from multiple sources.
- **Data Ingestion** – Importing data from diverse sources into storage or processing systems.
- **ETL (Extract, Transform, Load)** – Data is extracted, transformed, and then loaded into a target system.
- **ELT (Extract, Load, Transform)** – Data is extracted, loaded into storage, then transformed later.
- **Stream Processing** – Real-time ingestion and processing of data streams.
- **Batch Processing** – Ingesting and processing data in scheduled chunks.
- **Lambda Architecture** – Combines batch and real-time data processing for robustness.
## **3. Storage & Architecture**
Where and how data is stored and architected for processing and querying.
- **Data Lake** – Stores raw, unstructured, and structured data at scale.
- **Data Warehouse** – Optimized for structured data and analytical queries.
- **Data Lakehouse** – A hybrid system combining the strengths of lakes and warehouses.
- **Data Mart** – A focused, domain-specific subset of a data warehouse.
- **Delta Lake** – Adds ACID compliance to data lakes for reliability.
- **Sharding** – Distributing large datasets across multiple machines.
- **Partitioning** – Splitting datasets into segments for better manageability and performance.
- **Indexing** – Creating data structures to speed up query performance.
- **Caching** – Temporarily storing frequently accessed data to speed up performance.
## **4. Data Processing & Transformation**
Once data is ingested and stored, it’s shaped for analysis.
- **Data Pipeline** – A sequence of automated processes that move and transform data.
- **Data Orchestration** – Coordinates pipeline workflows and schedules (e.g., with Airflow).
- **Data Modeling** – Designing logical data structures, like schemas and relationships.
- **Schema Evolution** – Updating schemas over time without breaking existing processes.
- **Idempotency** – Ensures reprocessing data doesn’t lead to duplication or errors.
## **5. Data Access & Consumption**
Data becomes usable for analytics, reporting, ML, or apps.
- **OLAP (Online Analytical Processing)** – Enables complex analytical queries across large datasets.
- **Metadata** – Information about data (e.g., schema, source, owner) that aids discoverability.
- **Data Catalog** – An organized inventory of available datasets and metadata.
- **Data Mart** – (Repeated here intentionally) Enables targeted consumption for specific business units.
## **6. Governance, Monitoring & Quality**
Overseeing data integrity, compliance, and trust across the organization.
- **Data Governance** – Policies and controls for data privacy, security, and quality.
- **Data Lineage** – Traces how data flows and transforms through the system.
- **Data Quality** – Ensuring data is accurate, complete, consistent, and reliable.
## **7. Modern Data Architecture Approaches**
Emerging or decentralized models to scale and democratize data.
- **Big Data** – Refers to datasets too large or complex for traditional tools.
- **Data Mesh** – Decentralized, domain-oriented approach to data ownership.
- **Lambda Architecture** – Already covered above, fits here too for architectural design.
## **8. Other terms**
- **Data Observability** – Monitoring data health and detecting issues (e.g., freshness, anomalies).
- **Reverse ETL** – Syncing data from warehouses back to business tools (like CRMs or ad platforms).
- **Data Contracts** – Agreements on what data producers deliver and how consumers use it.
- **Row-level Security** – Controlling access to specific data rows based on user roles.