The Apache Software Foundation (ASF) announced Apache Gobblin as a Top-Level Project (TLP). Apache Gobblin is a distributed Big Data integration framework used in both streaming and batch data ecosystems. The project originated at LinkedIn in 2014, was open-sourced in 2015, and entered the Apache Incubator in February 2017. Apache Gobblin is used to integrate hundreds of terabytes and thousands of datasets per day by simplifying the ingestion, replication, organization, and lifecycle management processes across numerous execution environments, data velocities, scale, connectors, and more.
As a scalable data management solution for structured and byte-oriented data in heterogeneous data ecosystems, Apache Gobblin makes the task of creating and maintaining a modern data lake easy. It supports the three main capabilities required by every data team:
- Ingestion and export of data from a variety of sources and sinks into and out of the data lake while supporting simple transformations.
- Data Organization within the lake (e.g. compaction, partitioning, deduplication).
- Lifecycle and Compliance Management of data within the lake (e.g. data retention, fine-grain data deletions) driven by metadata.
Apache Gobblin software is released under the Apache License v2.0 and is overseen by a self-selected team of active contributors to the project. A Project Management Committee (PMC) guides the Project’s day-to-day operations, including community development and product releases.