Skip to main content

Open-Source Software Podcast Episodes

View All Tags

Lars Kamp
Julia Schottenstein

dbt Labs' mission is to empower data practitioners to create and disseminate organizational knowledge with its open-source product dbt. dbt helps write and execute data transformation jobs by compiling code to SQL and running it against your cloud warehouse.

When raw data from production or SaaS apps arrives in a cloud warehouse for analysis, it's not in a usable state. Analytics engineers need to prepare, clean, join, and transform the data to match business needs. These needs could include visualizing data for a sales forecast, feeding data into a machine learning model, or preparing operational analytics with infrastructure data. The analytics engineering workflow covers all the steps from raw data extraction to data modeling and end uses like reporting or data science.

Today, over 16,000 companies use dbt. dbt has become a foundational technology for the analytics engineering workflow, which is very similar to the DevOps workflow. dbt applies software engineering principles to working with data. To "productionize" data, engineers develop, test, and integrate it—and then also provide observability and alerting once it's in production. All of this functionality is included in dbt Cloud, the commercial version of dbt.

Julia Schottenstein heads Product at dbt Labs. In this episode, Julia walks us through the evolution of dbt from a tool for data teams at start-ups to enterprise deployments where sometimes thousands of analytics engineers collaborate through dbt. We cover all aspects of the modern data stack—cloud warehouses, ETL, data pipelines, and orchestration—with an outlook on the wider use of data in the enterprise by both humans and applications:

  • dbt's semantic layer, which assigns definitions (e.g., revenue, customer, churn) to a specific metric

    The semantic layer in dbt contains the definitions for each metric, ensuring consistency and flexibility—users can slice and dice a metric along any dimension. Metrics are computed at the time of a query rather than pointing to an already materialized view.

  • Continuous integration and deployment (CI/CD) for data

    Building data pipelines is expensive, and data transformation can take a long time with large data sets and complex queries. dbt Cloud ships a purpose-built CI tool that builds the absolute minimum set of code and data to test changes.

  • How dbt works, with its direct acyclic graph (DAG)

    The DAG is a visual representation of data models and the connections in-between them. dbt started out with SQL to run all transformations, but is now also inviting other languages such as Python.

Lars Kamp
Patrick DeVivo

Software engineering is often more art than science, making it difficult to measure productivity. There are ways to use data to be more effective as an individual contributor or an engineering leader, but surprisingly, engineering organizations and teams typically are not data-driven.

MergeStat is on a mission to change this with open-source, operational analytics for software engineering organizations. MergeStat started as an experiment to bring together two technologies: SQL and Git repositories. MergeStat provides data integration for your Git repositories, facilitating the exploration of legacy code and identification of code that hadn't been touched in a while and maybe deserved new attention.

From there, the use cases evolved. Today, MergeStat is used by organizations that have hundreds or even thousands of repositories. MergeStat is data infrastructure for Git repositories, where anyone can query the history and contents of their code bases.

Behind the scenes, MergeStat syncs data from the tools used to build and ship software into a PostgreSQL instance, as APIs provided by these tools are not always easy to understand and extract data from. MergeStat puts a lot of the usual work into implementing good API data consumption, like pagination and respecting rate limits.

From there, a user can query their data directly in MergeStat, or use other business intelligence tools and dashboards that know how to speak to PostgreSQL. See this example Grafana dashboard for GitHub pull requests.

Patrick DeVivo is Founder and CEO at MergeStat. In this session, we start out with a general overview of MergeStat and how it's used today.

Patrick explains how MergeStat is a general-purpose engine that companies use to craft the queries that fit their organization. We go into a few MergeStat use cases that Patrick sees today:

  • In some cases, the actual data collection is the use case. For example, with audits the action is to deliver the list of pull requests that didn't follow best practices.
  • Understanding the different versions of a programming language in use. If you're a Go shop, a single query aggregates the different Go versions used across all repositories.
  • Find pull requests that have been open for a long time or merged without review.

Patrick's advice is to use MergeStat in a way that is positive and constructive to take action. Watch this episode to learn more about data integration for the software development lifecycle.

Lars Kamp
Waldemar Hummer

Waldemar Hummer is Co-Founder and CTO at LocalStack. LocalStack gives you a fully functional local cloud stack so you can develop and test your cloud and serverless apps offline. LocalStack is an open-source project that started at Atlassian, where its initial purpose was to keep developers productive on their daily commutes despite poor internet connectivity.

LocalStack emulates AWS cloud services on your laptop, increasing the number of phases in your infrastructure environment to four: local, test, staging, and production—with LocalStack efficiently covering the local and test phases (including CI builds). LocalStack also integrates with a large set of other cloud tools, such as Terraform, Pulumi, and CDK.

While the commute problem went mostly away with COVID, it became clear that a local development environment has speed, quality, and cost advantages. Local provisioning of resources is faster and can speed up dev feedback cycles by an order of magnitude. Developers can start their work without IAM enforcement, then later introduce security policies and migrate to the cloud. A local environment also reduces the cost of cloud sandbox accounts.

A key requirement for LocalStack to be valuable is parity with cloud provider services, which means replicating services and API responses. LocalStack is built in Python, and Waldemar walks us through LocalStack's process of building out the platform to have 99% parity with AWS.

In this episode, we also cover developer marketing, community building, and how LocalStack amassed over 44,000 stars on GitHub. Waldemar takes us through both a live LocalStack demo and a deep-dive into LocalStack's GitHub repository.

Lars Kamp
Tobi Knaup

Tobi Knaup is the CEO and a co-founder at D2iQ, an enterprise Kubernetes platform. D2iQ combines the best open-source technology from the cloud-native technology landscape into a single Kubernetes solution. Customers can deploy this solution without worrying about the individual pieces they would otherwise need to assemble, maintain, and update.

In this episode, we discuss the shift to cloud-native infrastructure and how we are now seeing a new class of smart cloud-native applications emerge. Smart cloud-native applications include artificial intelligence (AI) components that leverage data from production applications. These new applications enable entirely new use cases in every industry. Examples are autonomous driving in automotive, medical imaging in healthcare, and fraud detection in banking or crypto trading.

To build smart cloud-native applications, companies need to build the infrastructure to train their AI models, put them into production, and build differentiated products. It is an entirely new type of workload, with very dynamic and elastic demand for compute and storage.

It turns out that Kubernetes, with its scheduling and orchestration capabilities, is a perfect fit to support workloads from smart cloud-native applications. Training models requires spinning up large amounts of compute to process data and then scaling back down. By putting model predictions into production, companies can lean on existing code pipeline workflows and monitoring.

This also means that instead of running two separate types of infrastructure, companies can consolidate and run their smart cloud-native applications on the same platforms as their production applications, which generate the data in the first place. The outcome for companies is highly differentiated digital products.

Listen to this episode to learn about Kubernetes, cloud-native architecture, and changes in organization and workflows that technology leaders need to adapt to deliver smart cloud-native applications.

Lars Kamp
Avi Press

Historically, the distribution and usage of open-source projects have been challenging to measure. 📏

Understanding your user base is critical to product planning and development. However, open-source maintainers have resorted to inelegant tactics to gather data on their users—such as scraping GitHub user data, performing reverse IP lookups on website traffic, or simply hoping that users submit support requests.

Scarf aims to solve this problem with its Scarf Gateway and Documentation Insights. The Scarf Gateway provides distribution analytics for open-source software and helps maintainers connect with commercial users. Documentation Insights aid in understanding how users interact with project websites and documentation.

In this episode, Lars chats with Avi Press, Founder and CEO at Scarf. Avi shares how Scarf grew from a hobby project into a venture-funded startup, as well as his thoughts on the future of open-source business models.

Contact Us

Have feedback or need help? Don’t be shy—we’d love to hear from you!

 

 

 

Some Engineering Inc.