Tuesday, February 4, 2025

Data Engineering and Best practices

Data and types

  • Data at rest (e.g. batch data pipelines / data stored in warehouses or object stores)
  • Data in motion (e.g. streaming pipelines / real-time use cases).

Hadoop ecosystem







CI/CD

https://towardsdev.com/ci-cd-for-modern-data-engineering-e2e7d2d0a694
  • RDD vs. DataFrame vs. Datasetlink
  • Creating Job in Spark link
  • Oozie Workflow with Spark Job:
  • RDD (resilient distributed dataset):
  • Spark context internal working
  • ETL pipeline with spark link


Areas to focus for Data Engineer


Prioritise understanding these core concepts first: These principles are timeless and transferable. New frameworks will emerge, some will fade, but these fundamentals will remain crucial:
🔹SQL: This is the bedrock. Master it. Understand joins, aggregations, window functions, and query optimisation.
🔹NoSQL Databases: Learn about different NoSQL models and when to use them. Understand their trade-offs.
🔹Database Internals: Grasp the difference between row/columnar databases, indexing, and transactions.
🔹Distributed Systems: Understand distributed computing, partitioning, consistency, and fault tolerance.
🔹Data Modeling: Learn different modeling techniques and how to design efficient schemas.
🔹ETL/ELT Concepts: Understand data processing, transformation, and data quality.

Once you have a solid grasp of these fundamentals, learning specific tools becomes much easier. You’ll understand why they work the way they do.

Regarding the modern data stack and big data tools, including cloud data warehouses and query engines:

Be aware of popular tools like dbt for transformations, Airflow/Prefect/Dagster for orchestration, Spark/Flink for processing, Kafka/Pulsar for streaming, and the evolving data lakehouse landscape with Iceberg/Delta Lake/Hudi. It's also important to understand the landscape of cloud data warehouses and high-performance query engines:
🔹Cloud Data Warehouses (Snowflake, BigQuery, AWS Redshift): These offer scalable and managed solutions for analytical workloads. Understand their strengths, weaknesses, and use cases.
🔹High-Performance Query Engines (ClickHouse, StarRocks): These are designed for real-time analytics and often used for specific use cases like dashboards and reporting.

Saturday, February 1, 2025

Databases concepts in details

Types of Databases link




NoSQL databases are different from each other. There are four kinds of this database: document databases, key-value stores, column-oriented databases, and graph databases.

   Note: Vector DB , Event store


Types of Databases

  • Hierarchical Databases
  • Relational Databases
  • NoSQL Databases
       Document mongodb, DocumentDb
       Key-value -> Redis, DynamoDB
       Columnar -> casssandra, bigtable, druid
       Graph -> Azure cosmos db
        Time series -> Influxdb, prometheus
  • Network Databases
  • Object-oriented Databases
  • Cloud Databases
  • Centralized Databases
  • Operational Databases
  • NewSQL database -> CockroachDb
  • FIle storage
  • Block storage
object storage vs block storage vs file storage
https://aws.amazon.com/compare/the-difference-between-block-file-object-storage/


Techniques for Optimizing

  • Avoiding Over-Indexing
  • Efficient Query Design
  • Use of Stored Procedures


Key Metrics to Track

To maintain the health of your database, it’s important to track key metrics that provide insights into its performance and stability:

  • QPS (Queries Per Second): Measures the number of queries processed per second, helping you understand the load on your database.
  • Latency: Tracks the time taken to execute queries, indicating the responsiveness of your system.
  • CPU and Memory Usage: Monitors the resource consumption of your database nodes, ensuring they are not overburdened.
  • Disk I/O: Measures the read and write operations on your storage devices, highlighting potential bottlenecks.
  • Replication Lag: Indicates the delay in data replication across nodes, which is crucial for maintaining consistency and availability.

Regular Maintenance Practices


Index Rebuilding

Indexes play a vital role in query performance, but they can become fragmented over time, leading to inefficiencies. Regularly rebuilding indexes helps maintain their effectiveness:

  • Reorganize Index: This operation defragments the index pages, improving read and write performance without locking the table.
  • Rebuild Index: This more intensive operation creates a new index and drops the old one, fully optimizing the index structure. It’s useful for heavily fragmented indexes but may require downtime.

Database Backups

Regular backups are essential for data protection and disaster recovery. TiDB provides several tools and strategies for effective backup management:

  • BR (Backup & Restore): A command-line tool designed for large-scale data backup and restoration. It supports both full and incremental backups, allowing you to efficiently manage your backup strategy.
  • Dumpling: A lightweight tool for exporting data from TiDB into SQL or CSV files. It’s useful for smaller datasets or when you need to migrate data between environments.



Thursday, January 16, 2025

System Design Interview Preparation




How do you design read heavy system?

https://medium.com/@vinciabhinav7/how-to-design-a-read-heavy-system-some-strategies-and-best-practices-20e416a77cfd query optimisation

How do you design write heavy system?

How to write billion records in DB? [In general, not just microservice]

How do you design low latency application?


How do you design highly available application?

How do you design highly available and fault tolerant?

How do u design data streaming application?

How to design bank managment system? 
 bms   link

How to warmup services in java?

What are different type of databases?

How do you select which database to use?




How to handle sudden bulkoad on capacity ?

   Rate-Limiting  webflux springboot
   Backpressure webflux link
   Batching up request link


How to secure passwords or secrets in production environment?
Not very good example found  [Generally use vault and refer directly from Infra code using plugin]


Serverless Computing


Big Data
https://www.altexsoft.com/blog/big-data-analytics-explained/


How to estimate load capacity of website?
  • Put your scenario in place
  • Add monitoring
  • Add traffic
  • Evaluate results
  • Remediate based on results
  • Rinse, repeat until reasonably happy
 Capacity Estimation: 



Optimize Performance of a High-Volume Financial Transactions API
You are working on a Spring Boot REST API that processes over 1 million financial transactions daily. The API is experiencing high latency and excessive CPU/memory usage. How do you optimize it for better performance?

Optimizing the performance of a high-volume financial transactions API involves several strategies to reduce latency and manage CPU and memory usage effectively. Here are key approaches to consider:

  1. Implement Caching: Utilize caching mechanisms to store frequently accessed data, reducing the need for repetitive database queries. Tools like Redis or Memcached can be employed to cache responses for high-traffic endpoints. Ensure that cache expiration policies are appropriately set to maintain data consistency.

  2. Optimize Database Interactions:

    • Efficient Queries: Review and optimize database queries to ensure they are performant. Avoid fetching unnecessary data and ensure that queries are indexed appropriately.
    • Connection Pooling: Use a fast connection pool, such as HikariCP, and configure it optimally to manage database connections efficiently.
  3. Employ Asynchronous Processing: For operations that don't require immediate responses, implement asynchronous processing. This approach allows the API to handle other requests while waiting for long-running tasks to complete, thereby improving overall responsiveness.

  4. Utilize Reactive Programming: Adopt a reactive, non-blocking programming model to handle concurrent requests more efficiently. This is particularly beneficial when the API acts as a pass-through to external services, as it allows threads to be reused while waiting for external responses.

  5. Implement Pagination and Filtering: For endpoints that return large datasets, incorporate pagination and filtering to limit the amount of data processed and transmitted in each request. This reduces server load and response times.

  6. Enable Response Compression: Compress API responses to reduce payload sizes, leading to faster transmission times and reduced bandwidth usage. GZIP is a commonly used compression method that can be enabled in Spring Boot applications.

  7. Monitor and Manage Thread Pools: Properly configure thread pools to match the application's workload and the server's capabilities. This ensures that the API can handle concurrent requests without overwhelming system resources.

  8. Implement Rate Limiting: Introduce rate limiting to prevent abuse and ensure fair usage of the API. This helps protect the system from being overwhelmed by excessive requests from a single client.

  9. Profile and Monitor Performance: Continuously monitor the API's performance to identify bottlenecks. Use profiling tools to gain insights into CPU and memory usage, and adjust configurations as needed to optimize resource utilization.

By systematically applying these strategies, you can enhance the performance of your Spring Boot REST API, ensuring it efficiently handles over a million financial transactions daily.


BASE vs. ACID:
  • Traditionally, NoSQL databases often followed the BASE (Basically Available, Soft state, Eventually consistent) model, which prioritizes availability and performance.   
  • Relational databases, on the other hand, adhere to ACID properties, ensuring data integrity.

Explain difference between throttling vs rate limit vs backpressure management

  Key Differences Summarized:
  • Rate limiting is a precise method for restricting request counts within a time window.  
  • Throttling is a more general term for controlling resource consumption, which may or may not include rate limiting.  
  • Backpressure management is specifically designed to handle flow control between components with different processing speeds.  

In simpler terms:

  • Rate limiting says, "You can only do X amount of things in Y time."  
  • Throttling says, "We need to slow things down to keep the system healthy."
  • Backpressure says "Slow down, I am getting overloaded."



System Design  








****************************************************************


Additional topics:

  • Dynamo - Highly Available Key-value Store
  • Kafka - A Distributed Messaging System for Log Processing
  • Consistent Hashing - Original paper
  • Paxos - Protocol for distributed consensus
  • Concurrency Controls - Optimistic methods for concurrency controls 
  • Gossip protocol - For failure detection and more.
  • Chubby - Lock service for loosely-coupled distributed systems
  • ZooKeeper - Wait-free coordination for Internet-scale systems
  • MapReduce - Simplified Data Processing on Large Clusters
  •  Hadoop - A Distributed File System 

 

Advance question:

Java & Backend Development
1. If Java didn’t have the synchronized keyword, how would you implement thread safety? link
2. How would you store a billion records in memory while ensuring efficient search operations?link
3. Explain Java’s ClassLoader in a way that a 10-year-old could understand.
4. What exactly happens inside the JVM when a NullPointerException is thrown?

System Design Challenges
5. Design a traffic management system for a city with self-driving cars.
6. If you had to reduce API response time by 50% in a large-scale system, where would you start?
7. How would you design a video streaming platform that adapts in real-time to network conditions?

Algorithm & Data Structures Curveballs
8. Can you sort an array faster than O(n log n)?
9. You have an infinite stream of numbers. How would you efficiently find the median at any point?
10. If you could only use one data structure for every problem, which one would it be and why?

Unique & Unexpected Questions
11. How would you explain recursion to someone who has never coded before?
12. If you could remove one feature from Java, what would it be and why?

13. Tell me something interesting about technology that isn’t on your resume.  


Broad Category of System Design

➥1. Load Balancer
Key Topics:
- Types of Load Balancers - Application Layer (L7) vs Network Layer (L4).
- Algorithms - Round Robin, Least Connections, IP Hashing.
- Health Checks - Monitoring server availability and performance.
- Sticky Sessions - Keeping user sessions tied to specific servers.
- Scaling Strategies - Horizontal vs Vertical scaling with load balancers.
- Global Load Balancers - Handling traffic across multiple regions.
- Reverse Proxy - Serving as a gateway and caching responses.

➥2. Application Server
Key Topics:
- Stateless vs Stateful Servers - When to use which.
- Caching Strategies - In-memory caching (Redis/Memcached) and local caching.
- Session Management - Cookies vs Tokens (JWT).
- Concurrency Handling - Managing multiple requests with threads or async models.
- Microservices Architecture - Service discovery and inter-service communication.
- Containerization - Docker, Kubernetes, and deployment strategies.
- Rate Limiting & Throttling - Preventing abuse and managing traffic bursts.

➥3. Database (SQL vs NoSQL)
Key Topics:
- SQL vs NoSQL - When to choose which database type.
- Sharding and Partitioning - Horizontal scaling techniques.
- Replication - Primary-Secondary, Multi-Master setups for reliability.
- Consistency Models - Strong vs Eventual Consistency (CAP theorem).
- Indexing Strategies - Improving query performance.
- Caching Layers - Redis, Memcached for faster reads.
- Backup and Recovery - Disaster recovery planning and failover systems.

➥4. Pub-Sub or Producer-Consumer
Key Topics:
- Messaging Patterns - Pub-Sub vs Queue-based systems.
- Message Brokers - Kafka, RabbitMQ, AWS SQS/SNS.
- Idempotency - Avoiding duplicate message processing.
- Durability and Ordering - Ensuring messages aren’t lost or misordered.
- Dead Letter Queues - Handling failed messages.
- Scaling Consumers - Parallel processing and worker pools.
- Eventual Consistency - Maintaining consistency with asynchronous systems.

➥5. Content Delivery Network (CDN)
Key Topics:
- How CDNs Work - Edge caching and reducing latency.
- Caching Policies - TTL (Time-to-Live) and Cache Invalidation.
- Geolocation Routing - Serving content from nearest data centers.
- Static vs Dynamic Content Delivery - Optimizing for both.
- SSL/TLS Termination - Secure communication.
- Load Distribution - Managing spikes in traffic.

- DDoS Protection - Preventing attacks and ensuring availability.         

    

  MicroService Developer roadmap

1. Microservices Architecture Basics: Monolithic vs. Microservices, characteristics (independence, scalability, resilience), and designing microservices boundaries (DDD - Domain-Driven Design).

2. Service Communication: Synchronous (REST, gRPC) vs. Asynchronous (Message Queues), API design and versioning, event-driven architecture, and event sourcing.

3. Data Management: Database per service, distributed data management (saga pattern, 2PC, CQRS), and handling data consistency across services.

4. Deployment Strategies: Containerization (Docker), orchestration (Kubernetes), and service discovery and registry (Eureka, Consul).

5. Frameworks and Tools: Spring Boot (Spring Cloud for microservices), Micronaut, Quarkus, or Dropwizard as alternatives.

6. Communication Protocols: RESTful APIs and gRPC, messaging systems (Kafka, RabbitMQ).

7. Databases: SQL (PostgreSQL, MySQL), NoSQL (MongoDB, Cassandra), and distributed caching (Redis, Memcached).

8. CI/CD Pipelines: Tools like Jenkins, GitHub Actions, GitLab CI, and deployment strategies like Blue-Green and Canary deployments.

9. Infrastructure as Code: Terraform, Ansible, or AWS CloudFormation.

10. Logging and Monitoring: Centralized logging (ELK Stack, Splunk) and monitoring tools (Prometheus, Grafana).

11. Resilience and Fault Tolerance: Circuit Breaker (Hystrix, Resilience4j), Bulkhead pattern, and retries.

12. Security: OAuth2, OpenID Connect, and API Gateways (Zuul, Spring Cloud Gateway, Kong).

13. Testing Microservices: Unit and integration testing, contract testing (Pact), and end-to-end testing.

14. Scalability Patterns: Horizontal and vertical scaling, load balancing (HAProxy, NGINX).

15. Distributed Tracing: Tools like Jaeger and Zipkin.

16. Anti-Patterns: Avoiding distributed monoliths and over-engineering microservices. 

 

Optimize Performance of a REST API Handling 1M+ Requests Daily
Problem Statement :You are working on a Spring Boot REST API that receives over 1 million requests daily. The API is slow, consuming high CPU and memory. How do you optimize it? (Code Snippet (Spring Boot Optimization). Implement Redis Caching)

Data Engineering and Best practices

Data and types Data at rest (e.g. batch data pipelines / data stored in warehouses or object stores) Data in motion (e.g. streaming pipeline...