Data Infrastructure Engineer San Francisco
As a platform company powering businesses all over the world, Stripe processes payments, runs marketplaces, detects fraud, helps entrepreneurs start an internet business from anywhere in the world. Stripe’s Data Infrastructure Engineers build the platform and build and improve data pipelines that manage that data for both internal and external users.
- Work with teams to build and continue to evolve data models and data flows to enable data driven decision-making
- Design alerting and testing to ensure the accuracy and timeliness of these pipelines. (e.g., improve instrumentation, optimize logging, etc)
- Create user friendly libraries that make distributed batch computation easy to write and test for all users across Stripe
- Identify the shared data needs across Stripe, understand their specific requirements, and build efficient and scalable data pipelines to meet the various needs to enable data-driven decisions across Stripe
You might be a fit for this role if you:
- Have a strong engineering background and are interested in data. You’ll be writing production Scala and Python code.
- Have experience developing and maintaining distributed systems built with open source tools.
- Have experience optimizing the end-to-end performance of distributed systems.
- Have experience in writing and debugging ETL jobs using a distributed data framework (Spark/Hadoop MapReduce etc…)
- Have experience managing and designing data pipelines Can follow the flow of data through various pipelines to debug data issues
- Have experience with Spark or Scalding
- Have experience with Airflow or other similar scheduling tools
- It’s not expected that you’ll have deep expertise in every dimension above, but you should be interested in learning any of the areas that are less familiar.
Some things you might work on:
- Write a unified user data model that gives a complete view of our users across a varied set of products like Stripe Connect and Stripe Atlas
- Continuing to lower the latency and bridge the gap between our production systems and our data warehouse by rethinking and optimizing our core data pipeline jobs
- Create libraries that enable engineers at Stripe to easily interact various serialization frameworks (e.g. thrift, bson, protobuf)
- Pair with user teams to optimize and rewrite business critical batch processing jobs in Spark
- Create robust and easy to use unit testing infrastructure for batch processing pipelines
- Build a framework and tools to re-architect data pipelines to run more incrementally.
- Build the data pipeline to help us track our time to response for our users and our total support ticket volume to inform staffing decisions on our support teams
We look forward to hearing from you.