Explore all CSV Processing open source software, libraries, packages, source code, cloud functions and APIs.

Popular New Releases in CSV Processing

Laravel-Excel

v3.1.33

PapaParse

5.3.0

q

Next Release Development Build

miller

Restore --tsvlite; add gssub and expand dhms functions

visidata

v2.8: Python 3.10 compatibility

Popular Libraries in CSV Processing

Laravel-Excel

by Maatwebsite doticonphpdoticon

star image 10159 doticonMIT

🚀 Supercharged Excel exports and imports in Laravel

PapaParse

by mholt doticonjavascriptdoticon

star image 9757 doticonMIT

Fast and powerful CSV (delimited text) parser that gracefully handles large files and malformed input

q

by harelba doticonpythondoticon

star image 8893 doticonGPL-3.0

q - Run SQL directly on delimited files and multi-file sqlite databases

xsv

by BurntSushi doticonrustdoticon

star image 7540 doticonNOASSERTION

A fast CSV command line toolkit written in Rust.

countries

by mledoze doticonphpdoticon

star image 5413 doticonODbL-1.0

World countries in JSON, CSV, XML and Yaml. Any help is welcome!

miller

by johnkerl doticongodoticon

star image 5203 doticonNOASSERTION

Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON

visidata

by saulpw doticonpythondoticon

star image 5081 doticonGPL-3.0

A terminal spreadsheet multitool for discovering and arranging data

csvkit

by wireservice doticonpythondoticon

star image 4887 doticonMIT

A suite of utilities for converting to and working with CSV, the king of tabular file formats.

tablib

by jazzband doticonpythondoticon

star image 4102 doticonMIT

Python Module for Tabular Datasets in XLS, CSV, JSON, YAML, &c.

Trending New libraries in CSV Processing

tv

by alexhallam doticonrustdoticon

star image 1302 doticonUnlicense

📺(tv) Tidy Viewer is a cross-platform CLI csv pretty printer that uses column styling to maximize viewer enjoyment.

MiniExcel

by shps951023 doticoncsharpdoticon

star image 748 doticonApache-2.0

Fast, Low-Memory, Easy Excel .NET helper to import/export/template spreadsheet

image2csv

by artperrin doticonpythondoticon

star image 668 doticonMIT

Convert tables stored as images to an usable .csv file

flat-ui

by githubocto doticontypescriptdoticon

star image 305 doticonMIT

github-artifact-exporter

by github doticontypescriptdoticon

star image 253 doticonMIT

A set of packages to make exporting artifacts from GitHub easier

dplyr-cli

by coolbutuseless doticonrdoticon

star image 237 doticonMIT

Manipulate CSV files on the command line using dplyr

react-papaparse

by Bunlong doticontypescriptdoticon

star image 228 doticonMIT

react-papaparse is the fastest in-browser CSV (or delimited text) parser for React. It is full of useful features such as CSVReader, CSVDownloader, readString, jsonToCSV, readRemoteFile, ... etc.

tresor-import

by tresorone doticonjavascriptdoticon

star image 192 doticonAGPL-3.0

Extract transactions from PDF statements of brokers/banks or "Portfolio Performance" CSV exports. Compatible with Tresor One activities

csv2

by p-ranav doticonc++doticon

star image 187 doticonMIT

Fast CSV parser and writer for Modern C++

Top Authors in CSV Processing

1

maxogden

9 Libraries

star icon550

2

bodastage

9 Libraries

star icon32

3

frictionlessdata

8 Libraries

star icon859

4

csvreader

6 Libraries

star icon219

5

theodi

5 Libraries

star icon279

6

faradayio

5 Libraries

star icon68

7

scottrobertson

5 Libraries

star icon27

8

shawnbot

4 Libraries

star icon67

9

FlatFilers

4 Libraries

star icon418

10

medialab

4 Libraries

star icon43

1

9 Libraries

star icon550

2

9 Libraries

star icon32

3

8 Libraries

star icon859

4

6 Libraries

star icon219

5

5 Libraries

star icon279

6

5 Libraries

star icon68

7

5 Libraries

star icon27

8

4 Libraries

star icon67

9

4 Libraries

star icon418

10

4 Libraries

star icon43

Trending Kits in CSV Processing

An ordered list of values is called a JSON array. It can store multiple values, strings, numbers, booleans, or objects in a JSON array. A comma must separate the values in the JSON array. A normal text file or stored data in column by column and split by a comma is called a CSV(comma-separated values).


Now will see the procedure to convert the JSON array to CSV,

  • Read the data from the JSON file and store the result as a string.
  • Construct a JSON object using the above string.
  • Get the JSON Array from the JSON Object.
  • Create a new CSV file using java. io. File.
  • Deliver a comma-delimited text from the JSONArray of JSONObjects and write it to the newly created CSV file.


The JSON can be used as a 'data-interchange format' and it is 'lightweight' and 'language independent'. It can parse text from a string to produce vector-like objects. The advantage of using JSON for data storage, it is safe for transferring the data and suitable across platforms. To store the data comparatively JSON is preferred better than CSV In terms of scalability of application or file and while working with a large volume of data. The most common usage of JSON is used in JavaScript-based applications that have browser extensions and websites as a part of their features.


Here is an example of how you can convert JSON array to CSV in Java:

Fig 1: Preview of the output that you will get on running this code from your IDE

Code

Instructions

  1. Copy the code using the "Copy" button above, and paste it in a Java file in your IDE.
  2. Add the required dependencies and import them in java file.
  3. Run the file to generate the output csv file.

I hope you found this useful. I have added the link to dependent libraries, version information in the following sections.


I found this code snippet by searching for 'json array list to csv format' in kandi. You can try any such use case!

Environment Tested

I tested this solution in the following versions. Be mindful of changes when working with other versions.

  1. The solution is created in Java 11.0.17.
  2. The solution is tested on JSON Version:20210307 and apache.commons:commons-io:1.3.2.


Using this solution, we are able to convert an json array to csv with simple steps. This process also facilities an easy way to use, hassle-free method to create a hands-on working version of code which would help us to convert an json array to csv.

Dependent Libraries

You can add the dependent library in your gradle or maven files. you can get the dependancy xml in above link

You can search for any dependent library on kandi like apache commons io and json java

Support

  1. For any support on kandi solution kits, please use the chat
  2. For further learning resources, visit the Open Weaver Community learning page.

The technique for converting a JSON array to a CSV file using Apache Commons IO in Java can be helpful in several situations where you want to export data stored in a JSON array to a CSV file. This may be helpful in the following cases, for instance:  

  • converting data saved in a database or other storage system that is represented as a JSON array;  
  • exporting data from a web application or API that returns data in JSON format;  
  • Transforming data from one format to another is part of the ETL (extract, transform, load) process. 

 

It offers a large selection of classes and methods that can be utilized to carry out different I/O-related operations, including reading and writing files, navigating directories and files, reading and writing to input and output streams, and more. When working with I/O operations in Java, Apache Commons IO is a valuable tool in your toolbox because it is a widely used library.  

Here is an example of how you can convert JSON array to CSV using Apache common-io in Java for your application: 


Fig 1: Preview of the output that you will get on running this code from your IDE

Code

  1. Copy the code using the "Copy" button above, and paste it in a Java file in your IDE.
  2. Add dependent library and import in java file.
  3. Run the file to generate csv file.

I hope you found this useful. I have added the link to dependent libraries, version information in the following sections.


I found this code snippet by searching for "json array to csv in java" in kandi. You can try any such use case!

Development Libraries

You can add the dependent library in your gradle or maven files. you can get the dependancy xml in above link

You can search for any dependent library on kandi like apache commons io and json java

Support

  1. For any support on kandi solution kits, please use the chat
  2. For further learning resources, visit the Open Weaver Community learning page.


You can use the pandas’ library in Python to append data to an existing table. Appending data to an existing table is a way to add new rows of data without modifying or deleting the existing data. This is helpful if you wish to update or add new data to a database while maintaining a historical record of the old data. append(): The append() method in pandas is used to add rows to a DataFrame, and return a new DataFrame with the newly added rows. The original DataFrame remains unchanged. The new DataFrame can be assigned to a variable, which can then be used for further processing or analysis. 

  • pd.concat(): pd.concat() is a function in the pandas library that is used to concatenate or join multiple DataFrames along a particular axis (axis=0 for rows and axis=1 for columns). 


pd.concat() is similar to the append() method, but it can be used to concatenate DataFrames along either the rows or columns axis, and it can also take a list of DataFrames as input, whereas append() can only take one DataFrame at a time and concatenate along the rows axis. 


For better knowledge of appending data in existing table using Pandas, you may have a look at the code below.

Fig : Preview of the output that you will get on running this code from your IDE.

Code

In this solution we're using Pandas library.

Instructions

Follow the steps carefully to get the output easily.

  1. Install pandas on your IDE(Any of your favorite IDE).
  2. Copy the snippet using the 'copy' and paste it in your IDE.
  3. Add required dependencies and import them in Python file.
  4. Run the file to generate the output.


I hope you found this useful. I have added the link to dependent libraries, version information in the following sections.


I found this code snippet by searching for 'how to append data in existing table using pandas' in kandi. You can try any such use case!

Environment Tested

I tested this solution in the following versions. Be mindful of changes when working with other versions.

  1. The solution is created in PyCharm 2021.3.
  2. The solution is tested on Python 3.9.7.
  3. Pandas version-v1.5.2.


Using this solution, we are able to append data in existing table using pandas with simple steps. This process also facilities an easy way to use, hassle-free method to create a hands-on working version of code which would help us to append data in existing table using pandas.

Dependent Library

You can also search for any dependent libraries on kandi like 'pandas'.

Support

  1. For any support on kandi solution kits, please use the chat
  2. For further learning resources, visit the Open Weaver Community learning page.


Trending Discussions on CSV Processing

Peformance issues reading CSV files in a Java (Spring Boot) application

Inserting json column in Bigquery

Avoid repeated checks in loop

golang syscall, locked to thread

How to break up a string into a vector fast?

CSV Regex skipping first comma

QUESTION

Peformance issues reading CSV files in a Java (Spring Boot) application

Asked 2022-Jan-29 at 12:37

I am currently working on a spring based API which has to transform csv data and to expose them as json. it has to read big CSV files which will contain more than 500 columns and 2.5 millions lines each. I am not guaranteed to have the same header between files (each file can have a completly different header than another), so I have no way to create a dedicated class which would provide mapping with the CSV headers. Currently the api controller is calling a csv service which reads the CSV data using a BufferReader.

The code works fine on my local machine but it is very slow : it takes about 20 seconds to process 450 columns and 40 000 lines. To improve speed processing, I tried to implement multithreading with Callable(s) but I am not familiar with that kind of concept, so the implementation might be wrong.

Other than that the api is running out of heap memory when running on the server, I know that a solution would be to enhance the amount of available memory but I suspect that the replace() and split() operations on strings made in the Callable(s) are responsible for consuming a large amout of heap memory.

So I actually have several questions :

#1. How could I improve the speed of the CSV reading ?

#2. Is the multithread implementation with Callable correct ?

#3. How could I reduce the amount of heap memory used in the process ?

#4. Do you know of a different approach to split at comas and replace the double quotes in each CSV line ? Would StringBuilder be of any healp here ? What about StringTokenizer ?

Here below the CSV method

1  public static final int NUMBER_OF_THREADS = 10;
2
3   public static List<List<String>> readCsv(InputStream inputStream) {
4            List<List<String>> rowList = new ArrayList<>();
5            ExecutorService pool = Executors.newFixedThreadPool(NUMBER_OF_THREADS);
6            List<Future<List<String>>> listOfFutures = new ArrayList<>();
7            try {
8                    BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream, StandardCharsets.UTF_8));
9                    String line = null;
10                    while ((line = reader.readLine()) != null) {
11                            CallableLineReader callableLineReader = new CallableLineReader(line);
12                            Future<List<String>> futureCounterResult = pool.submit(callableLineReader);
13                            listOfFutures.add(futureCounterResult);
14                    }
15                    reader.close();
16                    pool.shutdown();
17            } catch (Exception e) {
18                    //log Error reading csv file
19            }
20
21            for (Future<List<String>> future : listOfFutures) {
22                    try {
23                            List<String> row = future.get();
24                    }
25                    catch ( ExecutionException | InterruptedException e) {
26                            //log Error CSV processing interrupted during execution
27                    }
28            }
29
30            return rowList;
31    }
32

And the Callable implementation

1  public static final int NUMBER_OF_THREADS = 10;
2
3   public static List<List<String>> readCsv(InputStream inputStream) {
4            List<List<String>> rowList = new ArrayList<>();
5            ExecutorService pool = Executors.newFixedThreadPool(NUMBER_OF_THREADS);
6            List<Future<List<String>>> listOfFutures = new ArrayList<>();
7            try {
8                    BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream, StandardCharsets.UTF_8));
9                    String line = null;
10                    while ((line = reader.readLine()) != null) {
11                            CallableLineReader callableLineReader = new CallableLineReader(line);
12                            Future<List<String>> futureCounterResult = pool.submit(callableLineReader);
13                            listOfFutures.add(futureCounterResult);
14                    }
15                    reader.close();
16                    pool.shutdown();
17            } catch (Exception e) {
18                    //log Error reading csv file
19            }
20
21            for (Future<List<String>> future : listOfFutures) {
22                    try {
23                            List<String> row = future.get();
24                    }
25                    catch ( ExecutionException | InterruptedException e) {
26                            //log Error CSV processing interrupted during execution
27                    }
28            }
29
30            return rowList;
31    }
32public class CallableLineReader implements Callable<List<String>>  {
33
34        private final String line;
35
36        public CallableLineReader(String line) {
37                this.line = line;
38        }
39
40        @Override
41        public List<String> call() throws Exception {
42                return Arrays.asList(line.replace("\"", "").split(","));
43        }
44}
45

ANSWER

Answered 2022-Jan-29 at 02:56

I don't think that splitting this work onto multiple threads is going to provide much improvement, and may in fact make the problem worse by consuming even more memory. The main problem is using too much heap memory, and the performance problem is likely to be due to excessive garbage collection when the remaining available heap is very small (but it's best to measure and profile to determine the exact cause of performance problems).

The memory consumption would be less from the replace and split operations, and more from the fact that the entire contents of the file need to be read into memory in this approach. Each line may not consume much memory, but multiplied by millions of lines, it all adds up.

If you have enough memory available on the machine to assign a heap size large enough to hold the entire contents, that will be the simplest solution, as it won't require changing the code.

Otherwise, the best way to deal with large amounts of data in a bounded amount of memory is to use a streaming approach. This means that each line of the file is processed and then passed directly to the output, without collecting all of the lines in memory in between. This will require changing the method signature to use a return type other than List. Assuming you are using Java 8 or later, the Stream API can be very helpful. You could rewrite the method like this:

1  public static final int NUMBER_OF_THREADS = 10;
2
3   public static List<List<String>> readCsv(InputStream inputStream) {
4            List<List<String>> rowList = new ArrayList<>();
5            ExecutorService pool = Executors.newFixedThreadPool(NUMBER_OF_THREADS);
6            List<Future<List<String>>> listOfFutures = new ArrayList<>();
7            try {
8                    BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream, StandardCharsets.UTF_8));
9                    String line = null;
10                    while ((line = reader.readLine()) != null) {
11                            CallableLineReader callableLineReader = new CallableLineReader(line);
12                            Future<List<String>> futureCounterResult = pool.submit(callableLineReader);
13                            listOfFutures.add(futureCounterResult);
14                    }
15                    reader.close();
16                    pool.shutdown();
17            } catch (Exception e) {
18                    //log Error reading csv file
19            }
20
21            for (Future<List<String>> future : listOfFutures) {
22                    try {
23                            List<String> row = future.get();
24                    }
25                    catch ( ExecutionException | InterruptedException e) {
26                            //log Error CSV processing interrupted during execution
27                    }
28            }
29
30            return rowList;
31    }
32public class CallableLineReader implements Callable<List<String>>  {
33
34        private final String line;
35
36        public CallableLineReader(String line) {
37                this.line = line;
38        }
39
40        @Override
41        public List<String> call() throws Exception {
42                return Arrays.asList(line.replace("\"", "").split(","));
43        }
44}
45public static Stream<List<String>> readCsv(InputStream inputStream) {
46    BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream, StandardCharsets.UTF_8));
47    return reader.lines().map(line -> Arrays.asList(line.replace("\"", "").split(",")));
48}
49

Note that this throws unchecked exceptions in case of an I/O error.

This will read and transform each line of input as needed by the caller of the method, and will allow previous lines to be garbage collected if they are no longer referenced. This then requires that the caller of this method also consume the data line by line, which can be tricky when generating JSON. The JakartaEE JsonGenerator API offers one possible approach. If you need help with this part of it, please open a new question including details of how you're currently generating JSON.

Source https://stackoverflow.com/questions/70900587

QUESTION

Inserting json column in Bigquery

Asked 2021-Jun-02 at 06:55

I have a JSON that I want to insert into BQ. The column data type is STRING. Here is the sample JSON value.

1"{\"a\":\"#\",\"b\":\"b value\"}"
2

This is a bulk load from a CSV file.

The error I'm getting is

1"{\"a\":\"#\",\"b\":\"b value\"}"
2Error: Data between close double quote (\") and field separator."; Reason: "invalid"},Invalid {Location: ""; Message: "Error while reading data, error message: CSV processing encountered too many errors, giving up. Rows: 0; errors: 1; max bad: 0; error percent: 0"; Reason: "invalid"}
3

Thanks!

ANSWER

Answered 2021-Jun-02 at 06:55

I think there is an issue with how you escape the double quotes. I could reproduce the issue you describe, and fixed it by escaping the double quotes with " instead of a backslash \:

1"{\"a\":\"#\",\"b\":\"b value\"}"
2Error: Data between close double quote (\") and field separator."; Reason: "invalid"},Invalid {Location: ""; Message: "Error while reading data, error message: CSV processing encountered too many errors, giving up. Rows: 0; errors: 1; max bad: 0; error percent: 0"; Reason: "invalid"}
3"{""a"":""#"",""b"":""b value""}"
4

This information is well-hidden in the doc there (in the "Quote" section):

For example, if you want to escape the default character ' " ', use ' "" '.

Source https://stackoverflow.com/questions/67799161

QUESTION

Avoid repeated checks in loop

Asked 2021-Apr-23 at 11:51

I'm sorry if this has been asked before. It probably has, but I just have not been able to find it. On with the question:

I often have loops which are initialized with certain conditions that affect or (de)activate certain behaviors inside them, but do not drastically change the general loop logic. These conditions do not change through the loop's operation, but have to be checked every iteration anyways. Is there a way to optimized said loop in a pythonic way to avoid doing the same check every single time? I understand this would be a compiler's job in any compiled language, but there ain't no compiler here.

Now, for a specific example, imagine I have a function that parses a CSV file with a format somewhat like this, where you do not know in advance the columns that will be present on it:

1COL_A,COL_B,COL_F,COL_H,COL_M,COL_T,COL_Z
21,2,3,4,5,6,7
38,9,10,11,12,13,14
4...
5

And you have this function to manually process the file (I know there are better ways to deal with CSVs, the question is about optimizing the loop, not about CSV processing). For one reason or another, columns COL_C, COL_D, COL_M and COL_N (to name a few) need special processing. This would result in something like:

1COL_A,COL_B,COL_F,COL_H,COL_M,COL_T,COL_Z
21,2,3,4,5,6,7
38,9,10,11,12,13,14
4...
5def process_csv(file):
6    with open(file, 'r') as f:
7        headers = f.readline().split(',')
8        has_C = "COL_C" in headers
9        has_D = "COL_D" in headers
10        has_M = "COL_M" in headers
11        has_N = "COL_N" in headers
12        
13        for line in f:
14            elements = line.split(',')
15            if has_C:
16                ...  # Special processing related to COL_C
17            if has_D:
18                ...  # Special processing related to COL_D
19            if has_M:
20                ...  # Special processing related to COL_M
21            if has_N:
22                ...  # Special processing related to COL_N
23            ...  # General processing, common to all iterations
24

As I said above, any way to factor out the checks in some way? It may not represent a noticeable impact for this example, but if you have 50 special conditions inside the loop, you end up doing 50 'unnecessary' checks for every single iteration.

--------------- EDIT ------------------

As another example of what I would be looking for, this is (very) evil code that optimizes the loop by not doing any check in it, but instead constructing the very loop itself according to the starting conditions. I suppose for a (VERY) long loop with MANY conditions, this solution may eventually be faster. This depends on how exec is handled, though, which I am not sure since I find it something to avoid...

1COL_A,COL_B,COL_F,COL_H,COL_M,COL_T,COL_Z
21,2,3,4,5,6,7
38,9,10,11,12,13,14
4...
5def process_csv(file):
6    with open(file, 'r') as f:
7        headers = f.readline().split(',')
8        has_C = "COL_C" in headers
9        has_D = "COL_D" in headers
10        has_M = "COL_M" in headers
11        has_N = "COL_N" in headers
12        
13        for line in f:
14            elements = line.split(',')
15            if has_C:
16                ...  # Special processing related to COL_C
17            if has_D:
18                ...  # Special processing related to COL_D
19            if has_M:
20                ...  # Special processing related to COL_M
21            if has_N:
22                ...  # Special processing related to COL_N
23            ...  # General processing, common to all iterations
24def new_process_csv(file):
25    with open(file, 'r') as f:
26        headers = f.readline().split(',')
27        code = \
28f'''
29for line in f:
30    elements = line.split(',')
31    {... if "COL_C" in headers else ''}
32    {... if "COL_D" in headers else ''}
33    {... if "COL_M" in headers else ''}
34    {... if "COL_N" in headers else ''}
35    ...  # General processing   
36'''
37        exec(code)
38

ANSWER

Answered 2021-Apr-23 at 11:36

Your code seems right to me, performance-wise.

You are doing your checks at the beginning of the loop:

1COL_A,COL_B,COL_F,COL_H,COL_M,COL_T,COL_Z
21,2,3,4,5,6,7
38,9,10,11,12,13,14
4...
5def process_csv(file):
6    with open(file, 'r') as f:
7        headers = f.readline().split(',')
8        has_C = "COL_C" in headers
9        has_D = "COL_D" in headers
10        has_M = "COL_M" in headers
11        has_N = "COL_N" in headers
12        
13        for line in f:
14            elements = line.split(',')
15            if has_C:
16                ...  # Special processing related to COL_C
17            if has_D:
18                ...  # Special processing related to COL_D
19            if has_M:
20                ...  # Special processing related to COL_M
21            if has_N:
22                ...  # Special processing related to COL_N
23            ...  # General processing, common to all iterations
24def new_process_csv(file):
25    with open(file, 'r') as f:
26        headers = f.readline().split(',')
27        code = \
28f'''
29for line in f:
30    elements = line.split(',')
31    {... if "COL_C" in headers else ''}
32    {... if "COL_D" in headers else ''}
33    {... if "COL_M" in headers else ''}
34    {... if "COL_N" in headers else ''}
35    ...  # General processing   
36'''
37        exec(code)
38        has_C = "COL_C" in headers
39        has_D = "COL_D" in headers
40        has_M = "COL_M" in headers
41        has_N = "COL_N" in headers
42

Inside the loop, you are not doing unnecessary checks, you are just checking the result of something you already have computed, which is super-fast and really does not need optimization. You can run a profiler on your code to convince yourself of that: https://docs.python.org/3/library/profile.html

If you are looking for readability, you may want to:

  • put the special cases in other methods.
  • store the headers in a set to check membership:
1COL_A,COL_B,COL_F,COL_H,COL_M,COL_T,COL_Z
21,2,3,4,5,6,7
38,9,10,11,12,13,14
4...
5def process_csv(file):
6    with open(file, 'r') as f:
7        headers = f.readline().split(',')
8        has_C = "COL_C" in headers
9        has_D = "COL_D" in headers
10        has_M = "COL_M" in headers
11        has_N = "COL_N" in headers
12        
13        for line in f:
14            elements = line.split(',')
15            if has_C:
16                ...  # Special processing related to COL_C
17            if has_D:
18                ...  # Special processing related to COL_D
19            if has_M:
20                ...  # Special processing related to COL_M
21            if has_N:
22                ...  # Special processing related to COL_N
23            ...  # General processing, common to all iterations
24def new_process_csv(file):
25    with open(file, 'r') as f:
26        headers = f.readline().split(',')
27        code = \
28f'''
29for line in f:
30    elements = line.split(',')
31    {... if "COL_C" in headers else ''}
32    {... if "COL_D" in headers else ''}
33    {... if "COL_M" in headers else ''}
34    {... if "COL_N" in headers else ''}
35    ...  # General processing   
36'''
37        exec(code)
38        has_C = "COL_C" in headers
39        has_D = "COL_D" in headers
40        has_M = "COL_M" in headers
41        has_N = "COL_N" in headers
42        headers = f.readline().split(',')
43        headers_set = set(headers)
44        
45        for line in f:
46            elements = line.split(',')
47            if "COL_C" in headers_set:
48                ...  # Special processing related to COL_C
49            if "COL_D" in headers_set:
50                ...  # Special processing related to COL_D
51            if "COL_M" in headers_set:
52                ...  # Special processing related to COL_M
53            if "COL_N" in headers_set:
54                ...  # Special processing related to COL_N
55

Source https://stackoverflow.com/questions/67228959

QUESTION

golang syscall, locked to thread

Asked 2021-Apr-21 at 15:29

I am attempting to create an program to scrape xml files. I'm experimenting with go because of it's goroutines. I have several thousand files, so some type of multiprocessing is almost a necessity...

I got a program to successfully run, and convert xml to csv(as a test, not quite the end result), on a test set of files, but when run with the full set of files, it gives this:

1runtime: program exceeds 10000-thread limit
2

I've been looking for similar problems, and theres a couple, but i haven't found one that was similar enough to solve this.

and finally heres some code im running:

1runtime: program exceeds 10000-thread limit
2// main func (start threads)
3
4for i := range filelist {
5  channels = append(channels, make(chan Test))
6  go Parse(files[i], channels[len(channels)-1])
7}
8
9// Parse func (individual threads)
10
11func Parse(fileName string, c chan Test) {
12defer close(c)
13
14doc := etree.NewDocument()
15if err := doc.ReadFromFile(fileName); err != nil {
16    return
17}
18
19root := doc.SelectElement("trc:TestResultsCollection")
20
21for _, test := range root.FindElements("//trc:TestResults/tr:ResultSet/tr:TestGroup/tr:Test") {
22    var outcome Test
23    outcome.StepType = test.FindElement("./tr:Extension/ts:TSStepProperties/ts:StepType").Text()
24    outcome.Result = test.FindElement("./tr:Outcome").Attr[0].Value
25    for _, attr := range test.Attr {
26        if attr.Key == "name" {
27            outcome.Name = attr.Value
28        }
29    }
30
31    for _, attr := range test.FindElement("./tr:TestResult/tr:TestData/c:Datum").Attr {
32        if attr.Key == "value" {
33            outcome.Value = attr.Value
34        }
35    }
36
37    c <- outcome
38}
39
40}
41
42// main (process results when threads return)
43
44for c := 0; c < len(channels); c++ {
45    for i := range channels[c] {
46        // csv processing with i
47    }
48}
49

I'm sure theres some ugly code in there. I've just picked up go recently from other languages...so i apologize in advance. anyhow

any ideas?

ANSWER

Answered 2021-Apr-21 at 15:25

I apologize for not including the correct error. as the comments pointed out i was doing something dumb and creating a routine for every file. Thanks to JimB for correcting me, and torek for providing a solution and this link. https://gobyexample.com/worker-pools

1runtime: program exceeds 10000-thread limit
2// main func (start threads)
3
4for i := range filelist {
5  channels = append(channels, make(chan Test))
6  go Parse(files[i], channels[len(channels)-1])
7}
8
9// Parse func (individual threads)
10
11func Parse(fileName string, c chan Test) {
12defer close(c)
13
14doc := etree.NewDocument()
15if err := doc.ReadFromFile(fileName); err != nil {
16    return
17}
18
19root := doc.SelectElement("trc:TestResultsCollection")
20
21for _, test := range root.FindElements("//trc:TestResults/tr:ResultSet/tr:TestGroup/tr:Test") {
22    var outcome Test
23    outcome.StepType = test.FindElement("./tr:Extension/ts:TSStepProperties/ts:StepType").Text()
24    outcome.Result = test.FindElement("./tr:Outcome").Attr[0].Value
25    for _, attr := range test.Attr {
26        if attr.Key == "name" {
27            outcome.Name = attr.Value
28        }
29    }
30
31    for _, attr := range test.FindElement("./tr:TestResult/tr:TestData/c:Datum").Attr {
32        if attr.Key == "value" {
33            outcome.Value = attr.Value
34        }
35    }
36
37    c <- outcome
38}
39
40}
41
42// main (process results when threads return)
43
44for c := 0; c < len(channels); c++ {
45    for i := range channels[c] {
46        // csv processing with i
47    }
48}
49jobs := make(chan string, numJobs)
50results := make(chan []Test, numJobs)
51
52for w := 0; w < numWorkers; w++ {
53    go Worker(w, jobs, results)
54    wg.Add(1)
55}
56
57// give workers jobs
58
59for _, i := range files {
60    if filepath.Ext(i) == ".xml" {
61        jobs <- ("Path to files" + i)
62    }
63}
64
65close(jobs)
66wg.Wait()
67
68//result processing <- results
69

Source https://stackoverflow.com/questions/67182393

QUESTION

How to break up a string into a vector fast?

Asked 2020-Jul-31 at 21:54

I am processing CSV and using the following code to process a single line.

play with code

1std::vector<std::string> string_to_vector(const std::string& s, const char delimiter, const char escape) {
2  std::stringstream sstr{s};
3  std::vector<std::string> result;
4  while (sstr.good()) {
5    std::string substr;
6    getline(sstr, substr, delimiter);
7    while (substr.back() == escape) {
8      std::string tmp;
9      getline(sstr, tmp, delimiter);
10      substr += "," + tmp;
11    }
12    result.emplace_back(substr);
13  }
14  return result;
15}
16

What it does: Function breaks up string s based on delimiter. If the delimiter is escaped with escape the delimiter will be ignored.

This code works but is super slow. How can I speed it up?

Do you know any existing csv processing implementation that does exactly this and which I could use?

ANSWER

Answered 2020-Jul-31 at 21:54

The fastest way to do something is to not do it at all.

If you can ensure that your source string s will outlive the use of the returned vector, you could replace your std::vector<std::string> with std::vector<char*> which would point to the beginning of each substring. You then replace your identified delimiters with zeroes.

[EDIT] I have not moved up to C++17, so no string_view for me :)

NOTE: typical CSV is different from what you imply; it doesn't use escape for the comma, but surrounds entries with comma in it with double quotes. But I assume you know your data.

Implementation:

1std::vector&lt;std::string&gt; string_to_vector(const std::string&amp; s, const char delimiter, const char escape) {
2  std::stringstream sstr{s};
3  std::vector&lt;std::string&gt; result;
4  while (sstr.good()) {
5    std::string substr;
6    getline(sstr, substr, delimiter);
7    while (substr.back() == escape) {
8      std::string tmp;
9      getline(sstr, tmp, delimiter);
10      substr += &quot;,&quot; + tmp;
11    }
12    result.emplace_back(substr);
13  }
14  return result;
15}
16#include &lt;iostream&gt;
17#include &lt;vector&gt;
18#include &lt;string&gt;
19
20std::vector&lt;char*&gt; string_to_vector(std::string&amp; s, 
21                                    const char delimiter, const char escape) 
22{
23  size_t prev(0), pos(0), from(0);
24  std::vector&lt;char*&gt; v;
25  while ((pos = s.find(delimiter, from)) != s.npos)
26  {
27    if (pos == 0 || s[pos - 1] != escape)
28    {
29      s[pos] = 0;
30      v.push_back(&amp;s[prev]);
31      prev = pos + 1;
32    }
33    from = pos + 1;
34  }
35  v.push_back(&amp;s[prev]);
36  return v;
37}
38
39int main() {
40  std::string test(&quot;this,is,a\\,test&quot;);
41  std::vector&lt;char*&gt; v = string_to_vector(test, ',', '\\');
42
43  for (auto&amp; s : v)
44    std::cout &lt;&lt; s &lt;&lt; &quot; &quot;;
45}
46

Source https://stackoverflow.com/questions/63197165

QUESTION

CSV Regex skipping first comma

Asked 2020-May-11 at 22:02

I am using regex for CSV processing where data can be in Quotes, or no quotes. But if there is just a comma at the starting column, it skips it.

Here is the regex I am using: (?:,"|^")(""|[\w\W]*?)(?=",|"$)|(?:,(?!")|^(?!"))([^,]*?|)(?=$|,)

Now the example data I am using is: ,"data",moredata,"Data" Which should have 4 matches ["","data","moredata","Data"], but it always skips the first comma. It is fine if there is quotes on the first column, or it is not blank, but if it is empty with no quotes, it ignores it.

Here is a sample code I am using for testing purposes, it is written in Dart:

1
2void main() {
3
4  String delimiter = ",";
5  String rawRow = ',,"data",moredata,"Data"';
6RegExp exp = new RegExp(r'(?:'+ delimiter + r'"|^")(^,|""|[\w\W]*?)(?="'+ delimiter + r'|"$)|(?:'+ delimiter + '(?!")|^(?!"))([^'+ delimiter + r']*?)(?=$|'+ delimiter + r')');
7
8
9Iterable&lt;Match&gt; matches = exp.allMatches(rawRow.replaceAll("\n","").replaceAll("\r","").trim());
10List&lt;String&gt; row = new List();
11matches.forEach((Match m) {
12  //This checks to see which match group it found the item in.
13  String cellValue;
14  if (m.group(2) != null) {
15    //Data found without speech marks
16    cellValue = m.group(2);
17  } else if (m.group(1) != null) {
18    //Data found with speech marks (so it removes escaped quotes)
19    cellValue = m.group(1).replaceAll('""', '"');
20  }  else {
21    //Anything left
22    cellValue = m.group(0).replaceAll('""', '"');
23  }
24  row.add(cellValue);
25});
26  print(row.toString());
27
28}
29

ANSWER

Answered 2020-May-11 at 22:02

Investigating your expression

1
2void main() {
3
4  String delimiter = ",";
5  String rawRow = ',,"data",moredata,"Data"';
6RegExp exp = new RegExp(r'(?:'+ delimiter + r'"|^")(^,|""|[\w\W]*?)(?="'+ delimiter + r'|"$)|(?:'+ delimiter + '(?!")|^(?!"))([^'+ delimiter + r']*?)(?=$|'+ delimiter + r')');
7
8
9Iterable&lt;Match&gt; matches = exp.allMatches(rawRow.replaceAll("\n","").replaceAll("\r","").trim());
10List&lt;String&gt; row = new List();
11matches.forEach((Match m) {
12  //This checks to see which match group it found the item in.
13  String cellValue;
14  if (m.group(2) != null) {
15    //Data found without speech marks
16    cellValue = m.group(2);
17  } else if (m.group(1) != null) {
18    //Data found with speech marks (so it removes escaped quotes)
19    cellValue = m.group(1).replaceAll('""', '"');
20  }  else {
21    //Anything left
22    cellValue = m.group(0).replaceAll('""', '"');
23  }
24  row.add(cellValue);
25});
26  print(row.toString());
27
28}
29(,"|^")
30(""|[\w\W]*?)
31(?=",|"$)
32|
33(,(?!")|^(?!"))
34([^,]*?|)
35(?=$|,)
36

(,"|^")(""|[\w\W]*?)(?=",|"$) This part is to match quoted strings, that seem to work for you

Going through this part (,(?!")|^(?!"))([^,]*?|)(?=$|,)
(,(?!")|^(?!")) start with comma not followed by " OR start of line not followed by "
([^,]*?|) Start of line or comma zero or more non greedy and |, why |
(?=$|,) end of line or , .

In CSV this ,,,3,4,5 line should give 6 matches but the above only gets 5

You could add (^(?=,)) at the begining of second part, the part that matches non quoted sections.
Second group with match of start and also added non capture to groups
(?:^(?=,))|(?:,(?!")|^(?!"))(?:[^,]*?)(?=$|,)

Complete: (?:,"|^")(?:""|[\w\W]*?)(?=",|"$)|(?:^(?=,))|(?:,(?!")|^(?!"))(?:[^,]*?)(?=$|,)

Here is another that might work
(?:(?:"(?:[^"]|"")*"|(?<=,)[^,]*(?=,))|^[^,]+|^(?=,)|[^,]+$|(?<=,)$)
How that works i described here: Build CSV parser using regex

Source https://stackoverflow.com/questions/61584722

Community Discussions contain sources that include Stack Exchange Network

Tutorials and Learning Resources in CSV Processing

Tutorials and Learning Resources are not available at this moment for CSV Processing

Share this Page

share link

Get latest updates on CSV Processing