Data Engineering Terms

## **1. Data Generation / Sources** Where data originates from — could be applications, devices, logs, APIs, etc. - **Data Sources** – Refers to the origin points of data like databases, REST APIs, file systems, or streaming platforms. - **CDC (Change Data Capture)** – Captures changes in source data systems in real time. - **OLTP (Online Transaction Processing)** – Handles real-time transactions (e.g., banking, e-commerce). ## **2. Data Ingestion** Bringing data into your ecosystem from multiple sources. - **Data Ingestion** – Importing data from diverse sources into storage or processing systems. - **ETL (Extract, Transform, Load)** – Data is extracted, transformed, and then loaded into a target system. - **ELT (Extract, Load, Transform)** – Data is extracted, loaded into storage, then transformed later. - **Stream Processing** – Real-time ingestion and processing of data streams. - **Batch Processing** – Ingesting and processing data in scheduled chunks. - **Lambda Architecture** – Combines batch and real-time data processing for robustness. ## **3. Storage & Architecture** Where and how data is stored and architected for processing and querying. - **Data Lake** – Stores raw, unstructured, and structured data at scale. - **Data Warehouse** – Optimized for structured data and analytical queries. - **Data Lakehouse** – A hybrid system combining the strengths of lakes and warehouses. - **Data Mart** – A focused, domain-specific subset of a data warehouse. - **Delta Lake** – Adds ACID compliance to data lakes for reliability. - **Sharding** – Distributing large datasets across multiple machines. - **Partitioning** – Splitting datasets into segments for better manageability and performance. - **Indexing** – Creating data structures to speed up query performance. - **Caching** – Temporarily storing frequently accessed data to speed up performance. ## **4. Data Processing & Transformation** Once data is ingested and stored, it’s shaped for analysis. - **Data Pipeline** – A sequence of automated processes that move and transform data. - **Data Orchestration** – Coordinates pipeline workflows and schedules (e.g., with Airflow). - **Data Modeling** – Designing logical data structures, like schemas and relationships. - **Schema Evolution** – Updating schemas over time without breaking existing processes. - **Idempotency** – Ensures reprocessing data doesn’t lead to duplication or errors. ## **5. Data Access & Consumption** Data becomes usable for analytics, reporting, ML, or apps. - **OLAP (Online Analytical Processing)** – Enables complex analytical queries across large datasets. - **Metadata** – Information about data (e.g., schema, source, owner) that aids discoverability. - **Data Catalog** – An organized inventory of available datasets and metadata. - **Data Mart** – (Repeated here intentionally) Enables targeted consumption for specific business units. ## **6. Governance, Monitoring & Quality** Overseeing data integrity, compliance, and trust across the organization. - **Data Governance** – Policies and controls for data privacy, security, and quality. - **Data Lineage** – Traces how data flows and transforms through the system. - **Data Quality** – Ensuring data is accurate, complete, consistent, and reliable. ## **7. Modern Data Architecture Approaches** Emerging or decentralized models to scale and democratize data. - **Big Data** – Refers to datasets too large or complex for traditional tools. - **Data Mesh** – Decentralized, domain-oriented approach to data ownership. - **Lambda Architecture** – Already covered above, fits here too for architectural design. ## **8. Other terms** - **Data Observability** – Monitoring data health and detecting issues (e.g., freshness, anomalies). - **Reverse ETL** – Syncing data from warehouses back to business tools (like CRMs or ad platforms). - **Data Contracts** – Agreements on what data producers deliver and how consumers use it. - **Row-level Security** – Controlling access to specific data rows based on user roles.