lakeFS | Data version control for your data lake | Cloud Storage library
kandi X-RAY | lakeFS Summary
kandi X-RAY | lakeFS Summary
lakeFS is an open source tool that transforms your object storage into a Git-like repository. It enables you to manage your data lake the way you manage your code. With lakeFS you can build repeatable, atomic and versioned data lake operations - from complex ETL jobs to data science and analytics. lakeFS supports AWS S3, Azure Blob Storage and Google Cloud Storage as its underlying storage service. It is API compatible with S3, and works seamlessly with all modern data frameworks such as Spark, Hive, AWS Athena, Presto, etc. For more information see the official lakeFS documentation.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of lakeFS
lakeFS Key Features
lakeFS Examples and Code Snippets
@Controller
public class SwaggerController {
private final JsonSerializer jsonSerializer;
private final SwaggerResourcesProvider swaggerResources;
@Autowired
public SwaggerController(JsonSerializer jsonSerializer, Swagger
Community Discussions
Trending Discussions on lakeFS
QUESTION
Do I need a garbage collector in LakeFS when I delete an object from a branch by API? Using appropriate method of course. Do I understand right that the garbage collector is used only for objects that are deleted by a commit. And this objects are soft deleted (by the commit). And if I use the delete API method than the object is hard deleted and I don’t need to invoke the garbage collector?
...ANSWER
Answered 2021-Dec-14 at 07:58lakeFS manages versions of your data. So deletions only affect successive versions. The object itself remains, and can be accessed by accessing an older version.
Garbage collection removes the underlying files. Once the file is gone, its key is still visible in older versions, but if you try to access the file itself you will receive HTTP status code 410 Gone
.
For full information, please see the Garbage collection docs.
QUESTION
{
"Id": "Policy1590051531320",
"Version": "2012-10-17",
"Statement": [
{ "Sid": "Stmt1590051522178",
"Action": [ "s3:GetObject",
"s3:GetObjectVersion",
"s3:PutObject",
"s3:AbortMultipartUpload",
"s3:ListMultipartUploadParts",
"s3:GetBucketVersioning",
"s3:ListBucket",
"s3:GetBucketLocation",
"s3:ListBucketMultipartUploads",
"s3:ListBucketVersions" ],
"Effect": "Allow",
"Resource": ["arn:aws:s3:::lakefs", "arn:aws:s3:::lakefs/backend.txt/*"],
"Principal": {"AWS": ["arn:aws:iam::REDACTED:user/uing"]
}
}
]
}
...ANSWER
Answered 2021-Oct-12 at 02:52You can't have these spaces {
at the beginning. It should be:
QUESTION
How to find and hard delete objects older than n-days in LakeFS? Later it'll be a scheduled job.
...ANSWER
Answered 2021-Nov-28 at 10:34To do that you should use the Garbage Collection (GC) feature in lakeFS.
Note: This feature cleans objects from the storage only after they are deleted from your branches in lakeFS.
You will need to:
Define GC rules to set your desired retention period.
From the lakeFS UI, go to the repository you would like to hard delete objects from -> Settings -> Retention, and define the GC rule for each branch under the repository. For example -
QUESTION
I'm reading documentation about lakeFS and right now don't clearly understand what is a merge or even merge conflict in terms of lakeFS.
Let's say I use Apache Hudi for ACID support over a single table. I'd like to introduce multi-table ACID support and for this purpose would like to use lakeFS together with Hudi.
If I understand everything correctly, lakeFS is a data agnostic solution and knows nothing about the data itself. lakeFS only establishes boundaries (version control) and moderates somehow the concurent access to the data..
So the reasonable question is - if lakeFS is data agnostic, how it supports merge operation? What merge itself means in terms of lakeFS? And is it possible to have a merge conflict there?
...ANSWER
Answered 2021-Oct-04 at 16:59You do understand everything correctly. You could see in the branching model page that lakeFS is currently data agnostic and relies simply on the hierarchical directory structure. A conflict would occur when two branches update the same file. This behavior fits most data engineers CI/CD use cases.
In case you are working with Delta Lake and made changes to the same table from two different branches, there will still be a conflict because the two branches changed the log file. In order to resolve the conflict you would need to forgo one of the change sets. Admittedly this is not the best user experience and it's currently being worked on. You could read more about it on the roadmap documentation.
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install lakeFS
Ensure you have Docker & Docker Compose installed on your computer.
Run the following command: curl https://compose.lakefs.io | docker-compose -f - up
Open http://127.0.0.1:8000/setup in your web browser to set up an initial admin user, used to login and send API requests.
Ensure you have Docker installed.
Run the following command in PowerShell: Invoke-WebRequest https://compose.lakefs.io | Select-Object -ExpandProperty Content | docker-compose -f - up
Open http://127.0.0.1:8000/setup in your web browser to set up an initial admin user, used to login and send API requests.
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page