Coding Fungus: February 2025

Data and types

Data at rest (e.g. batch data pipelines / data stored in warehouses or object stores)
Data in motion (e.g. streaming pipelines / real-time use cases).

Hadoop ecosystem

https://www.upgrad.com/blog/hadoop-ecosystem-components/

https://saiparvathaneni.medium.com/hadoop-ecosystem-explained-like-youre-10-the-kitchen-analogy-595ff624165c

https://medium.com/towards-data-science/end-to-end-data-engineering-system-on-real-data-with-kafka-spark-airflow-postgres-and-docker-a70e18df4090

CI/CD

https://towardsdev.com/ci-cd-for-modern-data-engineering-e2e7d2d0a694

Big Data

https://bigdataschools.com/

Arcihtecture

https://mahmoudai.medium.com/the-backbone-of-data-engineering-5-key-architectural-patterns-explained-2327de58d99a

https://a16z.com/emerging-architectures-for-modern-data-infrastructure/

CDC

https://medium.com/@luishrsoares/the-change-data-capture-cdc-design-pattern-fa8d3adc964f

Data Streaming Architecture

https://medium.com/@techattu/modern-data-streaming-architecture-9d518c371066

Apache Spark link

RDD vs. DataFrame vs. Datasetlink
Creating Job in Spark link
Oozie Workflow with Spark Job:
RDD (resilient distributed dataset):
Spark context internal working
ETL pipeline with spark link

Areas to focus for Data Engineer

Prioritise understanding these core concepts first: These principles are timeless and transferable. New frameworks will emerge, some will fade, but these fundamentals will remain crucial:
🔹SQL: This is the bedrock. Master it. Understand joins, aggregations, window functions, and query optimisation.
🔹NoSQL Databases: Learn about different NoSQL models and when to use them. Understand their trade-offs.
🔹Database Internals: Grasp the difference between row/columnar databases, indexing, and transactions.
🔹Distributed Systems: Understand distributed computing, partitioning, consistency, and fault tolerance.
🔹Data Modeling: Learn different modeling techniques and how to design efficient schemas.
🔹ETL/ELT Concepts: Understand data processing, transformation, and data quality.

Once you have a solid grasp of these fundamentals, learning specific tools becomes much easier. You’ll understand why they work the way they do.

Regarding the modern data stack and big data tools, including cloud data warehouses and query engines:

Be aware of popular tools like dbt for transformations, Airflow/Prefect/Dagster for orchestration, Spark/Flink for processing, Kafka/Pulsar for streaming, and the evolving data lakehouse landscape with Iceberg/Delta Lake/Hudi. It's also important to understand the landscape of cloud data warehouses and high-performance query engines:
🔹Cloud Data Warehouses (Snowflake, BigQuery, AWS Redshift): These offer scalable and managed solutions for analytical workloads. Understand their strengths, weaknesses, and use cases.
🔹High-Performance Query Engines (ClickHouse, StarRocks): These are designed for real-time analytics and often used for specific use cases like dashboards and reporting.

Types of Databases link

https://builtin.com/articles/types-of-databases

https://www.mongodb.com/resources/basics/databases/types

NoSQL databases are different from each other. There are four kinds of this database: document databases, key-value stores, column-oriented databases, and graph databases.

Note: Vector DB , Event store

Types of Databases

Hierarchical Databases
Relational Databases
NoSQL Databases
Document mongodb, DocumentDb
Key-value -> Redis, DynamoDB
Columnar -> casssandra, bigtable, druid
Graph -> Azure cosmos db
Time series -> Influxdb, prometheus
Network Databases
Object-oriented Databases
Cloud Databases
Centralized Databases
Operational Databases
NewSQL database -> CockroachDb
FIle storage
Block storage

object storage vs block storage vs file storage

https://aws.amazon.com/compare/the-difference-between-block-file-object-storage/

Techniques for Optimizing

Avoiding Over-Indexing
Efficient Query Design
Use of Stored Procedures

Key Metrics to Track

To maintain the health of your database, it’s important to track key metrics that provide insights into its performance and stability:

QPS (Queries Per Second): Measures the number of queries processed per second, helping you understand the load on your database.
Latency: Tracks the time taken to execute queries, indicating the responsiveness of your system.
CPU and Memory Usage: Monitors the resource consumption of your database nodes, ensuring they are not overburdened.
Disk I/O: Measures the read and write operations on your storage devices, highlighting potential bottlenecks.
Replication Lag: Indicates the delay in data replication across nodes, which is crucial for maintaining consistency and availability.

Regular Maintenance Practices

Index Rebuilding

Indexes play a vital role in query performance, but they can become fragmented over time, leading to inefficiencies. Regularly rebuilding indexes helps maintain their effectiveness:

Reorganize Index: This operation defragments the index pages, improving read and write performance without locking the table.
Rebuild Index: This more intensive operation creates a new index and drops the old one, fully optimizing the index structure. It’s useful for heavily fragmented indexes but may require downtime.

Database Backups

Regular backups are essential for data protection and disaster recovery. TiDB provides several tools and strategies for effective backup management:

BR (Backup & Restore): A command-line tool designed for large-scale data backup and restoration. It supports both full and incremental backups, allowing you to efficiently manage your backup strategy.
Dumpling: A lightweight tool for exporting data from TiDB into SQL or CSV files. It’s useful for smaller datasets or when you need to migrate data between environments.

How do you select which database to use?

https://medium.com/@xsronhou/10-types-of-databases-every-software-engineers-should-know-c0f05ed0ec90

https://levelup.gitconnected.com/database-concerns-in-large-system-design-3f84b6331ff9

Cassandra

MongoDb mongodb crud

Coding Fungus

Tuesday, February 4, 2025

Data Engineering and Best practices

Saturday, February 1, 2025

Databases concepts in details