Data Engineer (PySpark&Airflow)

We foster a dynamic culture rooted in strong engineering, a sense of ownership, and transparency, empowering our team. As part of the expanding VirtusLab Group, we offer a compelling environment for those seeking to make a substantial impact in the software industry within a forward-thinking organization.About the roleJoin our team to developing heavy data pipelines with cooperation with data scientists and other engineers. Working with distributed data processing tools such as Spark, to parallelise computation for Machine Learning and data pipelines.

Diagnosing and resolving technical issues, ensuring availability of high-quality solutions that can be adapted and reused. Collaborating closely with different engineering and data science teams, providing advice and technical guidance to streamline daily work. Championing best practices in code quality, security, and scalability by leading by example.

Taking your own, informed decisions moving a business forward.Required skills:Senior: PythonRegular: PySparkRegular: AirflowRegular: DockerRegular: KubernetesRegular: xgboostRegular: PandasRegular: Scikit-learnRegular: NumpyRegular: GitHub ActionsRegular: Azure DevOpsRegular: Git @ GitHubProjectSTORE OPSProject scopeThe project aims at constructing, scaling and maintaining data pipelines for a simulation platform. You will be working on a solution to grant connectivity between AWS s3 and Cloudian s3. A previously completed Proof of Concept used Airflow to spin Spark job for some data extraction and to then expose the collected data via Airflow built-in XComs feature.

Further work required productionization of the PoC solution, testing it on scale, or proposing an alternate solution. As a Data Engineer in Store Ops, you will dive into projects that streamlining retail operations through the use of analytics and ML, by applying your Python, Spark, Airflow, Kubernetes skills.Tech StackPython, PySpark, Airflow, Docker, Kubernetes, Dask, xgboost, pandas, scikit-learn, numpy, GitHub Actions, Azure DevOps, Terraform, Git @ GitHubResponsibilitiesDeveloping heavy data pipelines with cooperation with data scientists and other engineers.Working with distributed data processing tools such as Spark, to parallelise computation for Machine Learning and data pipelines.Diagnosing and resolving technical issues, ensuring availability of high-quality solutions that can be adapted and reused.Collaborating closely with different engineering and data science teams, providing advice and technical guidance to streamline daily work.Championing best practices in code quality, security, and scalability by leading by example.Taking your own, informed decisions moving a business forward.ChallengesEnhancing the monitoring, reliability, and stability of deployed solutions, including the development of automated testing suites.Productionization of new data pipeline responsible for exposing data on demand and improve the performance on production.Collaborating with cross-functional teams enhancing customer experiences through innovative technologies.Team5 engineersWhat we expect in general:Hands-on experience with Python.Proven experience with PySpark.Proven experience with Data Manipulation libraries (Pandas, NumPy, and Scikit-learn)Regular-level experience with Apache Airflow.Strong background in ETL/ELT design.Regular-level proficiency in Docker and Kubernetes to containerize and scale simulation platform components.Ability to occasionally visit Krakow office.Seems like lots of expectations, huh? Don’t worry!

You don’t have to meet all the requirements.What matters most is your passion and willingness to develop. Apply and find out!A few perks of being with usBuilding tech communityFlexible hybrid work modelHome office reimbursementLanguage lessonsMyBenefit pointsPrivate healthcareTraining PackageVirtusity / in-house trainingAnd a lot more!

Opis stanowiska

Powiązane

Powiązane