Experience: 4-7 years
Location: Pune (India)
Primary Skills: Apache Spark (Java / Python / Scala), Apache Flink Hive, Impala
What You’ll Do
- Design, build and optimize distributed data processing systems on CDP.
- Architect batch and stream data pipelines using Apache Spark.
- Build streaming pipelines leveraging Flink, Hive and modern table formats like Iceberg.
- Develop high-performance data pipelines using Spark (Java/Python/Scala) on YARN-based clusters.
- Ensure data quality, reliability, and performance tuning across large-scale distributed systems.
- Develop and maintain ETL/ELT workflows orchestrated via Airflow.
Data Quality & Reliability:
- Define and enforce data quality checks, lineage tracking, and SLA monitoring across pipelines.
- Implement unit, integration, and end-to-end testing strategies for data pipelines.
- Troubleshoot performance bottlenecks in Spark jobs, Flink topologies, and Hive queries – applying techniques such as partition pruning, broadcast joins, and predicate pushdown.
Collaboration & Governance
- Partner with data architects, data scientists, and platform engineers to translate business requirements into robust data solutions.
- Participate in design reviews, technical documentation, and knowledge sharing within the team.
- Contribute to establishing engineering standards, coding guidelines, and best practices for the data engineering discipline.
- Provide technical leadership across teams, unblock complex projects, and mentor junior engineers.
- Translate product intent into technical plans, influence roadmaps with data-driven insights, and communicate trade-offs to executives and stakeholders.
Tech Stack
Framework: Apache Spark (Java / Python / Scala), Apache Flink
Query Engines: Hive, Impala
Storage & Formats: Apache Iceberg
Orchestration: Apache Airflow
Infrastructure: YARN-based clusters, CDP
What We’re Looking For
- 4-6 years of proven experience building distributed data systems with Apache Spark at scale.
- Strong proficiency in Python / Java / Scala for data engineering.
- Hands-on experience with streaming frameworks (Flink) and batch orchestration (Airflow).
- Deep understanding of data quality practices, SLA monitoring, and pipeline observability.
- Experience with modern table formats (Apache Iceberg preferred).
- Strong communication skills – ability to present trade-offs clearly to technical and non-technical stakeholders.