sqlbits | well tested functions that assist in building SQL statements | SQL Database library
kandi X-RAY | sqlbits Summary
kandi X-RAY | sqlbits Summary
An assortment of powerful & well tested functions that assist in building SQL statements
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of sqlbits
sqlbits Key Features
sqlbits Examples and Code Snippets
Community Discussions
Trending Discussions on sqlbits
QUESTION
The Data Lake approach (according to slide 5 here) is:
- Ingest all data - regardless of requirements
- Store all data - in native format without schema definition
- Do analysis - using engines like Hadoop
But let's say we have loaded up many many datasets to our data lake, how do I go about schema discovery in an automated and scalable manner? Does U-SQL support dynamic schema discovery or what would be a good way to go about it using ADLA or other toolset?
...ANSWER
Answered 2017-Aug-17 at 23:57This is a good question but the answer somewhat depends on the schema you want to discover.
Let me explain:
If you have CSV type data, there are tools, including the latest version of the ADL Tools for VisualStudio that will try to detect your schema from the provided data (the tools actually will generate the EXTRACT statement for you).
Some interactive languages may also give you extractors that try to infer the schema as part of the query. We do not support this in U-SQL at the moment, because you do not want to have a batch job infer the schema wrongly and fail after spending possibly a lot of money to run the job. In an interactive setting, it is less costly and can be easily corrected/overwritten by the query author.
If you have however data such as images or text documents and even nested, semistructured documents like JSON or XML, often the schema that you want has to be provided. E.g., if you have a JPEG file, do you want the EXIF properties? If so which ones? Or some feature extraction? Or some color analysis? etc.
So I think one thing that is important when designing a data lake is to have some semantically meaningful organization of the native-format data into folder structures and either use Views/TVFs to provide the schematized view(s) in the meta data service to make them more easily discoverable, or use a service like Azure Data Catalog to describe the data.
If you have already data inside the lake's storage and you want to discover it, right now you would have to build some form of discovery with U-SQL and the SDKs or some tooling that goes against the WebHDFS APIs of the store.
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install sqlbits
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page