Git is a version control tool that is used for tracking and managing changes in the source code. It is primarily designed for software development that focuses mainly on the code. For data science use cases where there is a duality between the code and the data, Git is not the most efficient solution, as it is not optimized to handle large data files. Hence, there arises a need for data versioning tools, and lakeFS is one such solution.
lakeFS is a data version control tool that helps manage data as code using Git-like operations and helps achieve reproducible and high-quality data pipelines. lakeFS stores data in object stores and has support for storage services like AWS S3, Azure Blob Storage, and Google Cloud Storage, which helps users manage their data lake operations with high precision and repeatability. Moreover, lakeFS is format agnostic, i.e., it supports different data formats such as structured, unstructured, open table, etc.
lakeFS has a distributed architecture consisting of several logical services. Additionally, its server is stateless, i.e., more instances can be added easily to handle additional loads. lakeFS leverages key-value storage (with support for databases like PostgreSQL and DynamoDB) for metadata, which is used to manage data versions in a scalable manner.
Advantages of using lakeFS
- lakeFS allows users to roll back to previous commits in case of bad data.
- lakeFS finds its use cases in data science, data engineering, and data operations workflows.
- lakeFS facilitates collaboration among developers.
- It helps in robust data pre-processing, including outlier handling and filling in missing values.
- It makes implementing CI/CD pipelines for data easier by providing automation of checks and validations of data, which can be triggered by certain data operations.
- Developers can also run different experiments in parallel and select the best-performing model.
- lakeFS branches allow the creation of test environments, which helps reduce testing time by 80%.
- lakeFS helps reduce storage costs as it helps developers get an isolated data lake for their use.
Limitations of lakeFS
- lakeFS has some problems with deleting data, with users reporting issues with removing commits and no data deduplication in which the same files are stored with different scrambled names.
- There's also some complexity involved with lakeFS, as some of its features require technical expertise.
- Some users are also unclear about the value of pre-commit and pre-merge hooks for their data pipelines.
In conclusion, lakeFS is a data version control tool that makes it easy to handle and manage changes in the data and helps achieve reproducible and high-quality data pipelines. It is powered by Git-like operations and finds its use case in data science, data engineering, and data operations workflows. Some users have, however, reported some limitations of the tool, such as difficulty in removing commits and the need for technical understanding of some features. Nevertheless, the tool has been under active development and is expected to improve over time.