etl | Extract Transform Load ) data processing library | Data Migration library

by flow-php PHP Version: 0.2.7 License: MIT

X-Ray Key Features Code Snippets Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | etl Summary

etl is a PHP library typically used in Manufacturing, Utilities, Energy, Utilities, Migration, Data Migration applications. etl has no bugs, it has no vulnerabilities, it has a Permissive License and it has low support. You can download it from GitHub.

Flow is a most advanced and flexible PHP, data processing library that is designed according to Filters & Pipes architecture. Except typical ETL use cases (Extract, Transform, Load), Flow can be also used for memory-safe data analysis. By default, all operations are synchronous, but for bigger datasets Flow offers also an asynchronous pipeline.

Support

Quality

Security

License

Reuse

Support

etl has a low active ecosystem.

It has 291 star(s) with 18 fork(s). There are 17 watchers for this library.

It had no major release in the last 6 months.

There are 0 open issues and 10 have been closed. On average issues are closed in 18 days. There are no pull requests.

It has a neutral sentiment in the developer community.

The latest version of etl is 0.2.7

Quality

etl has 0 bugs and 0 code smells.

Security

etl has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

etl code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

etl is licensed under the MIT License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

etl releases are not available. You will need to build from source code and install.

Installation instructions are not available. Examples and code snippets are available.

It has 8775 lines of code, 1191 functions and 229 files.

It has medium code complexity. Code complexity directly impacts maintainability of the code.

Top functions reviewed by kandi - BETA

kandi has reviewed etl and discovered the below as its top functions. This is intended to give you an instant insight into etl implemented functionality, and help decide if they suit your requirements.

Checks if two values are equal .
Checks if two arrays are equal .
Compares two arrays .
Creates a new exception .
Throw an exception .
Skip rows .

Get all kandi verified functions for this library.

etl Key Features

No Key Features are available at this moment for etl.

etl Examples and Code Snippets

No Code Snippets are available at this moment for etl.

Community Discussions

Trending Discussions on etl

google mock unable to mock a method with a templated argument

Can I convert RDD to DataFrame in Glue?

SSIS package fails to process all rows with C# Script task when started with SQL Server Agent

R: Using STRSPLIT and GREP on vector elements on large dataset takes too long

How to use SET XACT_ABORT ON the right way

AWS Glue 3.0 container not working for Jupyter notebook local development

System property not being read by log4j2 -> lo4j1 bridge

Python - too many values to unpack while looking for a file

How to do an inner join rather than for each loop in SSIS?

R Function 'box::help()' Cannot Generate Help File: "Invalid Argument"

QUESTION

google mock unable to mock a method with a templated argument

Asked 2022-Mar-29 at 22:01

I am not sure if what I am trying to do is possible, but I have a hard time with the compiler trying to mock a method which contains a templated reference parameter.

The interface (removed all irrelevant methods)

...

ANSWER

Answered 2022-Mar-29 at 22:01

Well, this is strange, but simple using fixes your problem.

Source https://stackoverflow.com/questions/71538502

QUESTION

Can I convert RDD to DataFrame in Glue?

Asked 2022-Mar-20 at 13:58

my lambda function triggers glue job by boto3 glue.start_job_run

and here is my glue job script

...

ANSWER

Answered 2022-Mar-20 at 13:58

You can't define schema types using toDF(). By using toDF() method, we don't have the control over schema customization. Having said that, using createDataFrame() method we have complete control over the schema customization.

See below logic -

Source https://stackoverflow.com/questions/71547278

QUESTION

SSIS package fails to process all rows with C# Script task when started with SQL Server Agent

Asked 2022-Mar-07 at 16:58

I have a requirement to build a SSIS package that sends HTML formatted emails and then saves the emails as tiff files. I have created a script task that processes the necessary records and then coverts the HTML code to the tiff. I have split the process into separate packages, the email send works fine the converting HTML to tiff is causing the issue.

When running the package manually it will process all files without any issues. my test currently is about 315 files this needs to be able to process at least 1,000 when finished with the ability to send up to 10,000 at one time. The problem is when I set the package to execute using SQL Server Agent it stops at 207 files. The package is deployed to SQL Server 2019 in the SSIS Catalog

What I have tried so far

I started with the script being placed in a SSIS package and deployed to the server and calling the package from a step (works 99.999999% of the time with all packages) tried both 32 and 64 bit runtime. Never any error messages just Unexpected Termination when looking at the execution reports. When clicking in the catalog and executing package it will process all the files. The SQL Server Agent is using a proxy and I also created another proxy account with my admin credentials to test for any issues with the account.

Created another package to call the package and used the Execute Package Task to call the first package, same result 207 files. Changed the execute Process task to an Execute SQL Task and tried the script that is created to manually start a package in the catalog 207 files. Tried executing the script from the command line both through the other SSIS package and the SQL Server Agent directly same results 207 files. If I try any of those methods directly outside SQL Server Agent the process runs no issues.

I converted the script task to a console application and it works processing all the files. When calling the executable file from any method from the SQL Server Agent it once again stops at the 207 files.

I have consulted with the companies DBA and Systems teams and they have not found anything that could be causing this error. There seems to be some type of limit that no matter the method of execution SQL Server Agent will not allow. I have mentioned looking at third-party applications but have been told no.

I have included the code below that I have been able to piece together. I am a SQL developer so C# is outside my knowledge base. Is there a way to optimize the code so it only uses one thread or does a cleanup between each letter. There may be a need for this to create over ten thousand letters at certain times.

Update

I have replaced the code with the new updated code. The email and image creation are all included as this is what the final product must do. When sending the emails there is a primary and secondary email address and depending on what email address is used it will change what the body of the email contains. When looking at the code there is a section of try catch that sends to primary when indicated to and if that fails it send to secondary instead. I am guessing there is a much cleaner way of doing that section but this is my first program as I work in SQL for everything else.

Thank You for all the suggestions and help.

Updated Code

...

ANSWER

Answered 2022-Mar-07 at 16:58

I have resolved the issue so it meets the needs of my project. There is probably a better solution but this does work. Using the code above I created an executable file and limited the result set to top 100. Created a ssis package with a For Loop that does a record count from the staging table and kicks off the executable file. I performed several tests and was able to exceed the 10,000 limit that was a requirement to the project.

Source https://stackoverflow.com/questions/71353620

QUESTION

R: Using STRSPLIT and GREP on vector elements on large dataset takes too long

Asked 2022-Feb-15 at 23:15

(My first StackFlow question)

My goal is to improve the ETL process for identifying which NetApp file shares are related to which AD permission distribution groups. Currently an application named 'TreeSize' scans a number of volumes and outputs a number of large .CSV files (35mb+). I want to merge this data and remove all permission information where each group (or named user) doesn't start with a capital G or D ('^[GD]'). With over 700,000 rows to process, it's currently taking me over 24hr to run. I hope there is a better way to process this data more efficiently to drastically cut that time down.

Here is test data which resembles actual data once all files have been merged. Use rownum to adjust size of data. (Real data 700000+)

Test Data

...

ANSWER

Answered 2022-Feb-15 at 22:57

I think the key to speeding this up is to avoid looping over each row, when it can be done in a single vectorised operation for the strsplit and final paste operations.

Source https://stackoverflow.com/questions/71134119

QUESTION

How to use SET XACT_ABORT ON the right way

Asked 2022-Feb-01 at 08:14

We have recently been parachuted to a new ETL project with very bad code. I have in my hands a query with 700 rows and all sort of update.

I would like to debug it with SET XACT_ABORT ON; and the goal is to rollback everything if only one transaction fails.

But I find several way to archive it on StackOverflow like this:

...

ANSWER

Answered 2022-Feb-01 at 08:14

It is not the same. It decides when errors are thrown.

You should always use SET XACT_ABORT ON, because it is more consistent; almost always, an error will stop execution and throw an error. Else, half things throw errors and the other half continue execution.

There is a great article about this whole subject on Erland Sommarskog's site, and if you go at this point you will see a table which describes this strange behaviour. In conclusion, I recommend the General Pattern for Error Handling, as it is very well documented as well as provide you the opportunity to tweak it according to its own documentation.

Source https://stackoverflow.com/questions/69849710

QUESTION

AWS Glue 3.0 container not working for Jupyter notebook local development

Asked 2022-Jan-16 at 11:25

I am working on Glue in AWS and trying to test and debug in local dev. I follow the instruction here https://aws.amazon.com/blogs/big-data/developing-aws-glue-etl-jobs-locally-using-a-container/ to develop Glue job locally. On that post, they use Glue 1.0 image for testing and it works as it should be. However when I load and try to dev by Glue 3.0 version; I follow the guidance steps but, I can't open Jupyter notebook on :8888 like the post said even every step seems correct.

here my cmd to start a Jupyter notebook on Glue 3.0 container

...

ANSWER

Answered 2022-Jan-16 at 11:25

It seems that GLUE 3.0 image has some issues with SSL. A workaround for working locally is to disable SSL (you also have to change the script paths as documentation is not updated).

Source https://stackoverflow.com/questions/70491686

QUESTION

System property not being read by log4j2 -> lo4j1 bridge

Asked 2022-Jan-11 at 20:03

I am working on a large java web-based project that is based on some older versions of Spring and Hibernate, and uses log4j 1.2x. Due to recent vulnerabilities found in log4j2, we have been directed to upgrade to the latest version of log4j2. I am trying to implement the log4j2 log4j1 bridge, so that I don't have to update all our logging code in the application. It is all working fine, except I cannot specify where to store the log files, because the log4j1 bridge appears to not support system properties. I am passing in a ${catalina.base} property when starting my tomcat server, but the log4j1 bridge uses the literal text rather than substituting the propery value.

My maven pom.xml

...

ANSWER

Answered 2022-Jan-11 at 20:03

Unless I am mistaken, this is a bug in the support of Log4j 1.x XML configurations. Support for Log4j 1.x variable substitutions was introduced in this commit (it works only if you use org.apache.log4j.config.Log4j1ConfigurationFactory, which is not the default) for the properties format, but an equivalent change for the XML format is missing. You should report it.

In the meantime you can use ${sys:catalina.base} as a workaround (basically the Log4j 1.x bridge supports Log4j 2.x lookups, instead of simple system property substitution).

Edit: The Log4j 1.x bridge has three configuration factories:

Log4j1ConfigurationFactory supports only *.properties files and has been using property substitution since the commit mentioned above (2016),
support for property substitution in the PropertiesConfigurationFactory was added in LOG4J2-2951 as mentioned by Paul in the comments,
lack of support for property substitution in the XmlConfigurationFactory has been reported by Paul as LOG4J2-3328.

Source https://stackoverflow.com/questions/70628593

QUESTION

Python - too many values to unpack while looking for a file

Asked 2022-Jan-09 at 18:03

I'm trying to get a message that file not found when it doesn't match the file_name*.txt pattern in specific directory.

When calling the script with file_name*.txt argument, all works fine. While entering invalid name file_*.txt throws:

...

ANSWER

Answered 2022-Jan-08 at 13:13

The issue is this following line:

Source https://stackoverflow.com/questions/70632565

QUESTION

How to do an inner join rather than for each loop in SSIS?

Asked 2022-Jan-01 at 20:45

On the ETL server I have a DW user table.

On the prod OLTP server I have the sales database. I want to pull the sales only for users that are present in the user table on the ETL server.

Presently I am using an execute SQL task to fetch the DW users into a SSIS System.Object variable. Then using a for each loop to loop through each item (userid) in this variable and via a data flow task fetch the OLTP sales table for each user and dump it into the DW staging table. The for each is taking long time to run.

I want to be able to do an inner join so that the response is quicker, but I cant do this since they are on separate servers. Neither can I use a global temp table to make the inner join, for the same reason.

I tried to collect the DW users into a comma separated string variable and then using it (via string_split) to query into OLTP, but this is also taking more time at the pre-execute phase (not sure why exactly) even for small number of users.

I also am aware of lookup transform but that too will result in all oltp rows to be brought into the dw etl server to test the lookup condition.

Is there any alternate approach to be able to do an inner join by taking the list of users into the source?

Note: I do not have write permissions on the OLTP db.

...

ANSWER

Answered 2022-Jan-01 at 18:24

Based on the comments, I think we can use a temporary table to solve this.

Can you help me understand this restriction? "Neither can I use a global temp table to make the inner join, for the same reason."

The restriction is since oltp server and dw server are separate so can't have global temp table common to both servers. Hope makes sense.

The general pattern we're going to do is

Execute SQL Task to create a temporary table on the OLTP server
A Data Flow task to populate the new temporary table. Source = DW. Destination = OLTP. Ensure Delay Validation = True
Modify existing Data Flow. Modify source to be a query that uses the temporary table i.e. SELECT S.* FROM oltp.sales AS S WHERE EXISTS (SELECT * FROM #SalesPerson AS SP WHERE SP.UserId = S.UserId); Ensure Delay Validation = True

A long form answer on using temporary tables (global to set the metadata, regular thereafter) I don't use temp table in SSIS

Temporary tables, live in tempdb. Your OLTP and DW connection managers likely do not point to tempdb. To be able to reference a temporary table, local or global, in SSIS you need to either define an additional connection manager for the same server that points explicitly at tempdb so you can use the drop down in the source/destination components (technically accurate but dumb). Or, you use an SSIS Variable to hold the name of the table and use the ~From Variable~ named option in source/destination component (best option, maximum flexibility).

Soup to nuts example

I will use WideWorldImporters as my OLTP system and WideWorldImportersDW as my DW system.

One-time task

Open SQL Server Management Studio, SSMS, and connect to your OLTP system. Define a global temporary table with a unique name and the expected structure. Leave your connection open so the table structure remains intact during initial development.

I used the following statement.

Source https://stackoverflow.com/questions/70530036

QUESTION

R Function 'box::help()' Cannot Generate Help File: "Invalid Argument"

Asked 2021-Dec-30 at 16:38

Motivation

My colleagues and I routinely create ad hoc scripts in R, to perform ETL on proprietary data and generate automated reports for clients. I am attempting to standardize our approach, for the sake of consistency, modularity, and reusability.

In particular, I want to consolidate our most commonly used functions in a central directory, and to access them as if they were functions from a proprietary R package. However, I am quite raw as an R developer, and my teammates are even less experienced in R development. As such, the development of a formal package is unfeasible for the moment.

Approach

Fortunately, the box package, by Stack Overflow's very own Konrad Rudolph, provides (among other modularity) an accessible approach to approximate the behavior of an R package. Unlike the rigorous development process outlined by the RStudio team, box requires only that one create a regular .R file, in a meaningful location, with roxygen2 documentation (#') and explicit @exports:

Writing modules
The module bio/seq, which we have used in the previous section, is implemented in the file bio/seq.r. The file seq.r is, by and large, a normal R source file, which happens to live in a directory named bio.

In fact, there are only three things worth mentioning:

Documentation. Functions in the module file can be documented using ‘roxygen2’ syntax. It works the same as for packages. The ‘box’ package parses the
documentation and makes it available via box::help. Displaying module help requires that ‘roxygen2’ is installed.

Export declarations. Similar to packages, modules explicitly need to declare which names they export; they do this using the annotation comment #' @export in front of the name. Again, this works similarly to ‘roxygen2’ (but does not require having that package installed).

⋮

At the moment, I am tinkering around with a particular module, as "imported" into a script. While the "import" itself works seamlessly, I cannot seem to access the documentation for my functions.

Code

I am experimenting with box on a Lenovo ThinkPad running Windows 10 Enterprise. I have created a script, aptly titled Script.R, whose location serves as my working directory. My module exists in the relative subdirectory ./Resources/Modules as the humble file time.R, reproduced here:

...

ANSWER

Answered 2021-Jul-30 at 23:42

As noted, that’s a bug, now fixed.

But since we’re here, a word on usage:

Source https://stackoverflow.com/questions/68583772

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install etl

You can download it from GitHub.
PHP requires the Visual C runtime (CRT). The Microsoft Visual C++ Redistributable for Visual Studio 2019 is suitable for all these PHP versions, see visualstudio.microsoft.com. You MUST download the x86 CRT for PHP x86 builds and the x64 CRT for PHP x64 builds. The CRT installer supports the /quiet and /norestart command-line switches, so you can also script it.

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: