Experience: 4-7 years

Location: Pune (India)

Primary Skills: Apache Spark (Java / Python / Scala), Apache Flink Hive, Impala

What You’ll Do

  • Design, build and optimize distributed data processing systems on CDP.
  • Architect batch and stream data pipelines using Apache Spark.
  • Build streaming pipelines leveraging Flink, Hive and modern table formats like Iceberg.
  • Develop high-performance data pipelines using Spark (Java/Python/Scala) on YARN-based clusters.
  • Ensure data quality, reliability, and performance tuning across large-scale distributed systems.
  • Develop and maintain ETL/ELT workflows orchestrated via Airflow.

Data Quality & Reliability:

  • Define and enforce data quality checks, lineage tracking, and SLA monitoring across pipelines.
  • Implement unit, integration, and end-to-end testing strategies for data pipelines.
  • Troubleshoot performance bottlenecks in Spark jobs, Flink topologies, and Hive queries – applying techniques such as partition pruning, broadcast joins, and predicate pushdown.

Collaboration & Governance

  • Partner with data architects, data scientists, and platform engineers to translate business requirements into robust data solutions.
  • Participate in design reviews, technical documentation, and knowledge sharing within the team.
  • Contribute to establishing engineering standards, coding guidelines, and best practices for the data engineering discipline.
  • Provide technical leadership across teams, unblock complex projects, and mentor junior engineers.
  • Translate product intent into technical plans, influence roadmaps with data-driven insights, and communicate trade-offs to executives and stakeholders.

Tech Stack

Framework: Apache Spark (Java / Python / Scala), Apache Flink
Query Engines: Hive, Impala
Storage & Formats: Apache Iceberg
Orchestration: Apache Airflow
Infrastructure: YARN-based clusters, CDP

What We’re Looking For

  • 4-6 years of proven experience building distributed data systems with Apache Spark at scale.
  • Strong proficiency in Python / Java / Scala for data engineering.
  • Hands-on experience with streaming frameworks (Flink) and batch orchestration (Airflow).
  • Deep understanding of data quality practices, SLA monitoring, and pipeline observability.
  • Experience with modern table formats (Apache Iceberg preferred).
  • Strong communication skills – ability to present trade-offs clearly to technical and non-technical stakeholders.

Apply for this position

Join our team and be part of meaningful innovation.