Skip to main content

· 3 min read
Matthaus Krzykowski

The number of Python developers increased from 7 million in 2017 to 15.7 million in Q1 2021 and grew by 3 million (20%) between Q4 2021 and Q1 2022 alone, making it the most popular programming language in Q3 2022. A large percentage of this new group are what we call Python practitionersdata folks and scripters. This group uses Python to do tasks in their jobs, but they do not consider themselves to be software engineers.

They are entering modern organizations in masse. Organizations often employ them for data-related jobs, especially in data engineering, data science / ML, and analytics. They must work with established data sources, data stores, and data pipelines that are essential to the business of these organizations These companies, though, are not providing them with the type of tooling they learnt to expect. There’s no “Jupyter Notebook, pandas, NumPy, etc. for data loading” for them to use.

At this stage of dlt we are focused on serving the needs of organizations with 150 employees or less. Companies of this size typically begin making their first data hires. They want data to be at their core: their CEOs may want to make their companies more “data driven” and “user feedback centric”. Their CTOs may want to “build a data warehouse for automation and self service”. They frequently are eager to take advantage of the skills of the Python practioners they have hired.

To achieve our mission of making this next generation of Python users autonomous in these organizations, we believe we need to build dlt in a “Pythonic” way. Anyone that can write a loop in Python script should be able to write a source and load it. There should minimal learning curve. Anyone in these organizations that gets basic Python should be able to use dlt right away.

However, we also recognize the need dlt to be loved not only by Python users but also data engineers to fulfill our mission. This is crucial because eventually these folks will be brought in to help with data loading in an organization. We need data engineers to evolve dlt pipelines rather than ripping them out and replacing them like they almost always do to scripts written by Python practitioners today.

To develop with dlt, anyone can install it like any other Python library with pip install dlt. They can then run dlt init and be ready to go. Already today data engineers love the automatic schema inference and evolution as well as the customizability of dlt.

· 3 min read
Matthaus Krzykowski

dltHub Mission

Since 2017, the number of Python users has been increasing by millions annually. The vast majority of these people leverage Python as a tool to solve problems at work. Our mission is to make this next generation of Python users autonomous when they create and use data in their organizations. For this end, we are building an open source Python library called data load tool (dlt).

These Python practitioners, as we call them, use dlt in their scripts to turn messy, unstructured data into regularly updated datasets. dlt empowers them to create highly scalable, easy to maintain, straightforward to deploy data pipelines without having to wait for help from a data engineer. When organizations eventually bring in data engineers to help with data loading, these engineers build on their work and evolve dlt pipelines.

We are dedicated to keeping dlt an open source project surrounded by a vibrant, engaged community. To make this sustainable, dltHub stewards dlt while also offering additional software and services that generate revenue (similar to what GitHub does with Git).

Why does dltHub exist?

We believe in a world where data loading becomes a commodity. A world where hundreds of thousands of pipelines will be created, shared, and deployed. A world where data sets, reports, and analytics will be written and shared publicly and privately.

To achieve our mission to make this next generation of Python users autonomous when they create and use data in their organizations, we need to address the requirements of both the Python practitioner and the data engineer with a minimal Python library. We also need dltHub to become the GitHub for data pipelines, facilitating and supporting the ecosystem of pipeline creators and maintainers as well as the other data folks who consume and analyze the data loaded.

There are lots of ETL/ELT tools available (300+!). Yet, as we engaged with Python practioners over the last one and half years, we found few Python practitioners that use traditional data ingestion tools. Only a handful have even heard of them. Very simplified, there’s two approaches in traditional data ingestion tools and neither works for this new generation: 1) SaaS solutions that handle the entire data loading process and 2) object-oriented frameworks for software engineers.

SaaS solutions do not give Python practitioners enough credit, while frameworks expect too much of them. In other words, there's no “Jupyter Notebook, pandas, NumPy, etc. for data loading” that meets users needs. As millions of Python practioners are now entering organizations every year, we think this should exist.

This demo works on codespaces. Codespaces is a development environment available for free to anyone with a Github account. You'll be asked to fork the demo repository and from there the README guides you with further steps.
The demo uses the Continue VSCode extension.

Off to codespaces!

DHelp

Ask a question

Welcome to "Codex Central", your next-gen help center, driven by OpenAI's GPT-4 model. It's more than just a forum or a FAQ hub – it's a dynamic knowledge base where coders can find AI-assisted solutions to their pressing problems. With GPT-4's powerful comprehension and predictive abilities, Codex Central provides instantaneous issue resolution, insightful debugging, and personalized guidance. Get your code running smoothly with the unparalleled support at Codex Central - coding help reimagined with AI prowess.