Popular New Releases in CSV Processing
Laravel-Excel
v3.1.33
PapaParse
5.3.0
q
Next Release Development Build
miller
Restore --tsvlite; add gssub and expand dhms functions
visidata
v2.8: Python 3.10 compatibility
Popular Libraries in CSV Processing
by Maatwebsite php
10159
MIT
๐ Supercharged Excel exports and imports in Laravel
by mholt javascript
9757
MIT
Fast and powerful CSV (delimited text) parser that gracefully handles large files and malformed input
by harelba python
8893
GPL-3.0
q - Run SQL directly on delimited files and multi-file sqlite databases
by BurntSushi rust
7540
NOASSERTION
A fast CSV command line toolkit written in Rust.
by mledoze php
5413
ODbL-1.0
World countries in JSON, CSV, XML and Yaml. Any help is welcome!
by johnkerl go
5203
NOASSERTION
Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON
by saulpw python
5081
GPL-3.0
A terminal spreadsheet multitool for discovering and arranging data
by wireservice python
4887
MIT
A suite of utilities for converting to and working with CSV, the king of tabular file formats.
by jazzband python
4102
MIT
Python Module for Tabular Datasets in XLS, CSV, JSON, YAML, &c.
Trending New libraries in CSV Processing
by alexhallam rust
1302
Unlicense
๐บ(tv) Tidy Viewer is a cross-platform CLI csv pretty printer that uses column styling to maximize viewer enjoyment.
by shps951023 csharp
748
Apache-2.0
Fast, Low-Memory, Easy Excel .NET helper to import/export/template spreadsheet
by artperrin python
668
MIT
Convert tables stored as images to an usable .csv file
by githubocto typescript
305
MIT
by github typescript
253
MIT
A set of packages to make exporting artifacts from GitHub easier
by coolbutuseless r
237
MIT
Manipulate CSV files on the command line using dplyr
by Bunlong typescript
228
MIT
react-papaparse is the fastest in-browser CSV (or delimited text) parser for React. It is full of useful features such as CSVReader, CSVDownloader, readString, jsonToCSV, readRemoteFile, ... etc.
by tresorone javascript
192
AGPL-3.0
Extract transactions from PDF statements of brokers/banks or "Portfolio Performance" CSV exports. Compatible with Tresor One activities
by p-ranav c++
187
MIT
Fast CSV parser and writer for Modern C++
Top Authors in CSV Processing
1
9 Libraries
550
2
9 Libraries
32
3
8 Libraries
859
4
6 Libraries
219
5
5 Libraries
279
6
5 Libraries
68
7
5 Libraries
27
8
4 Libraries
67
9
4 Libraries
418
10
4 Libraries
43
1
9 Libraries
550
2
9 Libraries
32
3
8 Libraries
859
4
6 Libraries
219
5
5 Libraries
279
6
5 Libraries
68
7
5 Libraries
27
8
4 Libraries
67
9
4 Libraries
418
10
4 Libraries
43
Trending Kits in CSV Processing
An ordered list of values is called a JSON array. It can store multiple values, strings, numbers, booleans, or objects in a JSON array. A comma must separate the values in the JSON array. A normal text file or stored data in column by column and split by a comma is called a CSV(comma-separated values).
Now will see the procedure to convert the JSON array to CSV,
- Read the data from the JSON file and store the result as a string.
- Construct a JSON object using the above string.
- Get the JSON Array from the JSON Object.
- Create a new CSV file using java. io. File.
- Deliver a comma-delimited text from the JSONArray of JSONObjects and write it to the newly created CSV file.
The JSON can be used as a 'data-interchange format' and it is 'lightweight' and 'language independent'. It can parse text from a string to produce vector-like objects. The advantage of using JSON for data storage, it is safe for transferring the data and suitable across platforms. To store the data comparatively JSON is preferred better than CSV In terms of scalability of application or file and while working with a large volume of data. The most common usage of JSON is used in JavaScript-based applications that have browser extensions and websites as a part of their features.
Here is an example of how you can convert JSON array to CSV in Java:
Fig 1: Preview of the output that you will get on running this code from your IDE
Code
Instructions
- Copy the code using the "Copy" button above, and paste it in a Java file in your IDE.
- Add the required dependencies and import them in java file.
- Run the file to generate the output csv file.
I hope you found this useful. I have added the link to dependent libraries, version information in the following sections.
I found this code snippet by searching for 'json array list to csv format' in kandi. You can try any such use case!
Environment Tested
I tested this solution in the following versions. Be mindful of changes when working with other versions.
- The solution is created in Java 11.0.17.
- The solution is tested on JSON Version:20210307 and apache.commons:commons-io:1.3.2.
Using this solution, we are able to convert an json array to csv with simple steps. This process also facilities an easy way to use, hassle-free method to create a hands-on working version of code which would help us to convert an json array to csv.
Dependent Libraries
You can add the dependent library in your gradle or maven files. you can get the dependancy xml in above link
You can search for any dependent library on kandi like apache commons io and json java
Support
- For any support on kandi solution kits, please use the chat
- For further learning resources, visit the Open Weaver Community learning page.
The technique for converting a JSON array to a CSV file using Apache Commons IO in Java can be helpful in several situations where you want to export data stored in a JSON array to a CSV file. This may be helpful in the following cases, for instance:
- converting data saved in a database or other storage system that is represented as a JSON array;
- exporting data from a web application or API that returns data in JSON format;
- Transforming data from one format to another is part of the ETL (extract, transform, load) process.
It offers a large selection of classes and methods that can be utilized to carry out different I/O-related operations, including reading and writing files, navigating directories and files, reading and writing to input and output streams, and more. When working with I/O operations in Java, Apache Commons IO is a valuable tool in your toolbox because it is a widely used library.
Here is an example of how you can convert JSON array to CSV using Apache common-io in Java for your application:
Fig 1: Preview of the output that you will get on running this code from your IDE
Code
- Copy the code using the "Copy" button above, and paste it in a Java file in your IDE.
- Add dependent library and import in java file.
- Run the file to generate csv file.
I hope you found this useful. I have added the link to dependent libraries, version information in the following sections.
I found this code snippet by searching for "json array to csv in java" in kandi. You can try any such use case!
Development Libraries
You can add the dependent library in your gradle or maven files. you can get the dependancy xml in above link
You can search for any dependent library on kandi like apache commons io and json java
Support
- For any support on kandi solution kits, please use the chat
- For further learning resources, visit the Open Weaver Community learning page.
You can use the pandasโ library in Python to append data to an existing table. Appending data to an existing table is a way to add new rows of data without modifying or deleting the existing data. This is helpful if you wish to update or add new data to a database while maintaining a historical record of the old data. append(): The append() method in pandas is used to add rows to a DataFrame, and return a new DataFrame with the newly added rows. The original DataFrame remains unchanged. The new DataFrame can be assigned to a variable, which can then be used for further processing or analysis.
- pd.concat(): pd.concat() is a function in the pandas library that is used to concatenate or join multiple DataFrames along a particular axis (axis=0 for rows and axis=1 for columns).
pd.concat() is similar to the append() method, but it can be used to concatenate DataFrames along either the rows or columns axis, and it can also take a list of DataFrames as input, whereas append() can only take one DataFrame at a time and concatenate along the rows axis.
For better knowledge of appending data in existing table using Pandas, you may have a look at the code below.
Fig : Preview of the output that you will get on running this code from your IDE.
Code
In this solution we're using Pandas library.
Instructions
Follow the steps carefully to get the output easily.
- Install pandas on your IDE(Any of your favorite IDE).
- Copy the snippet using the 'copy' and paste it in your IDE.
- Add required dependencies and import them in Python file.
- Run the file to generate the output.
I hope you found this useful. I have added the link to dependent libraries, version information in the following sections.
I found this code snippet by searching for 'how to append data in existing table using pandas' in kandi. You can try any such use case!
Environment Tested
I tested this solution in the following versions. Be mindful of changes when working with other versions.
- The solution is created in PyCharm 2021.3.
- The solution is tested on Python 3.9.7.
- Pandas version-v1.5.2.
Using this solution, we are able to append data in existing table using pandas with simple steps. This process also facilities an easy way to use, hassle-free method to create a hands-on working version of code which would help us to append data in existing table using pandas.
Dependent Library
You can also search for any dependent libraries on kandi like 'pandas'.
Support
- For any support on kandi solution kits, please use the chat
- For further learning resources, visit the Open Weaver Community learning page.
Trending Discussions on CSV Processing
Peformance issues reading CSV files in a Java (Spring Boot) application
Inserting json column in Bigquery
Avoid repeated checks in loop
golang syscall, locked to thread
How to break up a string into a vector fast?
CSV Regex skipping first comma
QUESTION
Peformance issues reading CSV files in a Java (Spring Boot) application
Asked 2022-Jan-29 at 12:37I am currently working on a spring based API which has to transform csv data and to expose them as json. it has to read big CSV files which will contain more than 500 columns and 2.5 millions lines each. I am not guaranteed to have the same header between files (each file can have a completly different header than another), so I have no way to create a dedicated class which would provide mapping with the CSV headers. Currently the api controller is calling a csv service which reads the CSV data using a BufferReader.
The code works fine on my local machine but it is very slow : it takes about 20 seconds to process 450 columns and 40 000 lines. To improve speed processing, I tried to implement multithreading with Callable(s) but I am not familiar with that kind of concept, so the implementation might be wrong.
Other than that the api is running out of heap memory when running on the server, I know that a solution would be to enhance the amount of available memory but I suspect that the replace() and split() operations on strings made in the Callable(s) are responsible for consuming a large amout of heap memory.
So I actually have several questions :
#1. How could I improve the speed of the CSV reading ?
#2. Is the multithread implementation with Callable correct ?
#3. How could I reduce the amount of heap memory used in the process ?
#4. Do you know of a different approach to split at comas and replace the double quotes in each CSV line ? Would StringBuilder be of any healp here ? What about StringTokenizer ?
Here below the CSV method
1 public static final int NUMBER_OF_THREADS = 10;
2
3 public static List<List<String>> readCsv(InputStream inputStream) {
4 List<List<String>> rowList = new ArrayList<>();
5 ExecutorService pool = Executors.newFixedThreadPool(NUMBER_OF_THREADS);
6 List<Future<List<String>>> listOfFutures = new ArrayList<>();
7 try {
8 BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream, StandardCharsets.UTF_8));
9 String line = null;
10 while ((line = reader.readLine()) != null) {
11 CallableLineReader callableLineReader = new CallableLineReader(line);
12 Future<List<String>> futureCounterResult = pool.submit(callableLineReader);
13 listOfFutures.add(futureCounterResult);
14 }
15 reader.close();
16 pool.shutdown();
17 } catch (Exception e) {
18 //log Error reading csv file
19 }
20
21 for (Future<List<String>> future : listOfFutures) {
22 try {
23 List<String> row = future.get();
24 }
25 catch ( ExecutionException | InterruptedException e) {
26 //log Error CSV processing interrupted during execution
27 }
28 }
29
30 return rowList;
31 }
32
And the Callable implementation
1 public static final int NUMBER_OF_THREADS = 10;
2
3 public static List<List<String>> readCsv(InputStream inputStream) {
4 List<List<String>> rowList = new ArrayList<>();
5 ExecutorService pool = Executors.newFixedThreadPool(NUMBER_OF_THREADS);
6 List<Future<List<String>>> listOfFutures = new ArrayList<>();
7 try {
8 BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream, StandardCharsets.UTF_8));
9 String line = null;
10 while ((line = reader.readLine()) != null) {
11 CallableLineReader callableLineReader = new CallableLineReader(line);
12 Future<List<String>> futureCounterResult = pool.submit(callableLineReader);
13 listOfFutures.add(futureCounterResult);
14 }
15 reader.close();
16 pool.shutdown();
17 } catch (Exception e) {
18 //log Error reading csv file
19 }
20
21 for (Future<List<String>> future : listOfFutures) {
22 try {
23 List<String> row = future.get();
24 }
25 catch ( ExecutionException | InterruptedException e) {
26 //log Error CSV processing interrupted during execution
27 }
28 }
29
30 return rowList;
31 }
32public class CallableLineReader implements Callable<List<String>> {
33
34 private final String line;
35
36 public CallableLineReader(String line) {
37 this.line = line;
38 }
39
40 @Override
41 public List<String> call() throws Exception {
42 return Arrays.asList(line.replace("\"", "").split(","));
43 }
44}
45
ANSWER
Answered 2022-Jan-29 at 02:56I don't think that splitting this work onto multiple threads is going to provide much improvement, and may in fact make the problem worse by consuming even more memory. The main problem is using too much heap memory, and the performance problem is likely to be due to excessive garbage collection when the remaining available heap is very small (but it's best to measure and profile to determine the exact cause of performance problems).
The memory consumption would be less from the replace
and split
operations, and more from the fact that the entire contents of the file need to be read into memory in this approach. Each line may not consume much memory, but multiplied by millions of lines, it all adds up.
If you have enough memory available on the machine to assign a heap size large enough to hold the entire contents, that will be the simplest solution, as it won't require changing the code.
Otherwise, the best way to deal with large amounts of data in a bounded amount of memory is to use a streaming approach. This means that each line of the file is processed and then passed directly to the output, without collecting all of the lines in memory in between. This will require changing the method signature to use a return type other than List
. Assuming you are using Java 8 or later, the Stream
API can be very helpful. You could rewrite the method like this:
1 public static final int NUMBER_OF_THREADS = 10;
2
3 public static List<List<String>> readCsv(InputStream inputStream) {
4 List<List<String>> rowList = new ArrayList<>();
5 ExecutorService pool = Executors.newFixedThreadPool(NUMBER_OF_THREADS);
6 List<Future<List<String>>> listOfFutures = new ArrayList<>();
7 try {
8 BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream, StandardCharsets.UTF_8));
9 String line = null;
10 while ((line = reader.readLine()) != null) {
11 CallableLineReader callableLineReader = new CallableLineReader(line);
12 Future<List<String>> futureCounterResult = pool.submit(callableLineReader);
13 listOfFutures.add(futureCounterResult);
14 }
15 reader.close();
16 pool.shutdown();
17 } catch (Exception e) {
18 //log Error reading csv file
19 }
20
21 for (Future<List<String>> future : listOfFutures) {
22 try {
23 List<String> row = future.get();
24 }
25 catch ( ExecutionException | InterruptedException e) {
26 //log Error CSV processing interrupted during execution
27 }
28 }
29
30 return rowList;
31 }
32public class CallableLineReader implements Callable<List<String>> {
33
34 private final String line;
35
36 public CallableLineReader(String line) {
37 this.line = line;
38 }
39
40 @Override
41 public List<String> call() throws Exception {
42 return Arrays.asList(line.replace("\"", "").split(","));
43 }
44}
45public static Stream<List<String>> readCsv(InputStream inputStream) {
46 BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream, StandardCharsets.UTF_8));
47 return reader.lines().map(line -> Arrays.asList(line.replace("\"", "").split(",")));
48}
49
Note that this throws unchecked exceptions in case of an I/O error.
This will read and transform each line of input as needed by the caller of the method, and will allow previous lines to be garbage collected if they are no longer referenced. This then requires that the caller of this method also consume the data line by line, which can be tricky when generating JSON. The JakartaEE JsonGenerator
API offers one possible approach. If you need help with this part of it, please open a new question including details of how you're currently generating JSON.
QUESTION
Inserting json column in Bigquery
Asked 2021-Jun-02 at 06:55I have a JSON that I want to insert into BQ. The column data type is STRING. Here is the sample JSON value.
1"{\"a\":\"#\",\"b\":\"b value\"}"
2
This is a bulk load from a CSV file.
The error I'm getting is
1"{\"a\":\"#\",\"b\":\"b value\"}"
2Error: Data between close double quote (\") and field separator."; Reason: "invalid"},Invalid {Location: ""; Message: "Error while reading data, error message: CSV processing encountered too many errors, giving up. Rows: 0; errors: 1; max bad: 0; error percent: 0"; Reason: "invalid"}
3
Thanks!
ANSWER
Answered 2021-Jun-02 at 06:55I think there is an issue with how you escape the double quotes.
I could reproduce the issue you describe, and fixed it by escaping the double quotes with "
instead of a backslash \
:
1"{\"a\":\"#\",\"b\":\"b value\"}"
2Error: Data between close double quote (\") and field separator."; Reason: "invalid"},Invalid {Location: ""; Message: "Error while reading data, error message: CSV processing encountered too many errors, giving up. Rows: 0; errors: 1; max bad: 0; error percent: 0"; Reason: "invalid"}
3"{""a"":""#"",""b"":""b value""}"
4
This information is well-hidden in the doc there (in the "Quote" section):
For example, if you want to escape the default character ' " ', use ' "" '.
QUESTION
Avoid repeated checks in loop
Asked 2021-Apr-23 at 11:51I'm sorry if this has been asked before. It probably has, but I just have not been able to find it. On with the question:
I often have loops which are initialized with certain conditions that affect or (de)activate certain behaviors inside them, but do not drastically change the general loop logic. These conditions do not change through the loop's operation, but have to be checked every iteration anyways. Is there a way to optimized said loop in a pythonic way to avoid doing the same check every single time? I understand this would be a compiler's job in any compiled language, but there ain't no compiler here.
Now, for a specific example, imagine I have a function that parses a CSV file with a format somewhat like this, where you do not know in advance the columns that will be present on it:
1COL_A,COL_B,COL_F,COL_H,COL_M,COL_T,COL_Z
21,2,3,4,5,6,7
38,9,10,11,12,13,14
4...
5
And you have this function to manually process the file (I know there are better ways to deal with CSVs, the question is about optimizing the loop, not about CSV processing). For one reason or another, columns COL_C
, COL_D
, COL_M
and COL_N
(to name a few) need special processing. This would result in something like:
1COL_A,COL_B,COL_F,COL_H,COL_M,COL_T,COL_Z
21,2,3,4,5,6,7
38,9,10,11,12,13,14
4...
5def process_csv(file):
6 with open(file, 'r') as f:
7 headers = f.readline().split(',')
8 has_C = "COL_C" in headers
9 has_D = "COL_D" in headers
10 has_M = "COL_M" in headers
11 has_N = "COL_N" in headers
12
13 for line in f:
14 elements = line.split(',')
15 if has_C:
16 ... # Special processing related to COL_C
17 if has_D:
18 ... # Special processing related to COL_D
19 if has_M:
20 ... # Special processing related to COL_M
21 if has_N:
22 ... # Special processing related to COL_N
23 ... # General processing, common to all iterations
24
As I said above, any way to factor out the checks in some way? It may not represent a noticeable impact for this example, but if you have 50 special conditions inside the loop, you end up doing 50 'unnecessary' checks for every single iteration.
--------------- EDIT ------------------
As another example of what I would be looking for, this is (very) evil code that optimizes the loop by not doing any check in it, but instead constructing the very loop itself according to the starting conditions. I suppose for a (VERY) long loop with MANY conditions, this solution may eventually be faster. This depends on how exec is handled, though, which I am not sure since I find it something to avoid...
1COL_A,COL_B,COL_F,COL_H,COL_M,COL_T,COL_Z
21,2,3,4,5,6,7
38,9,10,11,12,13,14
4...
5def process_csv(file):
6 with open(file, 'r') as f:
7 headers = f.readline().split(',')
8 has_C = "COL_C" in headers
9 has_D = "COL_D" in headers
10 has_M = "COL_M" in headers
11 has_N = "COL_N" in headers
12
13 for line in f:
14 elements = line.split(',')
15 if has_C:
16 ... # Special processing related to COL_C
17 if has_D:
18 ... # Special processing related to COL_D
19 if has_M:
20 ... # Special processing related to COL_M
21 if has_N:
22 ... # Special processing related to COL_N
23 ... # General processing, common to all iterations
24def new_process_csv(file):
25 with open(file, 'r') as f:
26 headers = f.readline().split(',')
27 code = \
28f'''
29for line in f:
30 elements = line.split(',')
31 {... if "COL_C" in headers else ''}
32 {... if "COL_D" in headers else ''}
33 {... if "COL_M" in headers else ''}
34 {... if "COL_N" in headers else ''}
35 ... # General processing
36'''
37 exec(code)
38
ANSWER
Answered 2021-Apr-23 at 11:36Your code seems right to me, performance-wise.
You are doing your checks at the beginning of the loop:
1COL_A,COL_B,COL_F,COL_H,COL_M,COL_T,COL_Z
21,2,3,4,5,6,7
38,9,10,11,12,13,14
4...
5def process_csv(file):
6 with open(file, 'r') as f:
7 headers = f.readline().split(',')
8 has_C = "COL_C" in headers
9 has_D = "COL_D" in headers
10 has_M = "COL_M" in headers
11 has_N = "COL_N" in headers
12
13 for line in f:
14 elements = line.split(',')
15 if has_C:
16 ... # Special processing related to COL_C
17 if has_D:
18 ... # Special processing related to COL_D
19 if has_M:
20 ... # Special processing related to COL_M
21 if has_N:
22 ... # Special processing related to COL_N
23 ... # General processing, common to all iterations
24def new_process_csv(file):
25 with open(file, 'r') as f:
26 headers = f.readline().split(',')
27 code = \
28f'''
29for line in f:
30 elements = line.split(',')
31 {... if "COL_C" in headers else ''}
32 {... if "COL_D" in headers else ''}
33 {... if "COL_M" in headers else ''}
34 {... if "COL_N" in headers else ''}
35 ... # General processing
36'''
37 exec(code)
38 has_C = "COL_C" in headers
39 has_D = "COL_D" in headers
40 has_M = "COL_M" in headers
41 has_N = "COL_N" in headers
42
Inside the loop, you are not doing unnecessary checks, you are just checking the result of something you already have computed, which is super-fast and really does not need optimization. You can run a profiler on your code to convince yourself of that: https://docs.python.org/3/library/profile.html
If you are looking for readability, you may want to:
- put the special cases in other methods.
- store the headers in a set to check membership:
1COL_A,COL_B,COL_F,COL_H,COL_M,COL_T,COL_Z
21,2,3,4,5,6,7
38,9,10,11,12,13,14
4...
5def process_csv(file):
6 with open(file, 'r') as f:
7 headers = f.readline().split(',')
8 has_C = "COL_C" in headers
9 has_D = "COL_D" in headers
10 has_M = "COL_M" in headers
11 has_N = "COL_N" in headers
12
13 for line in f:
14 elements = line.split(',')
15 if has_C:
16 ... # Special processing related to COL_C
17 if has_D:
18 ... # Special processing related to COL_D
19 if has_M:
20 ... # Special processing related to COL_M
21 if has_N:
22 ... # Special processing related to COL_N
23 ... # General processing, common to all iterations
24def new_process_csv(file):
25 with open(file, 'r') as f:
26 headers = f.readline().split(',')
27 code = \
28f'''
29for line in f:
30 elements = line.split(',')
31 {... if "COL_C" in headers else ''}
32 {... if "COL_D" in headers else ''}
33 {... if "COL_M" in headers else ''}
34 {... if "COL_N" in headers else ''}
35 ... # General processing
36'''
37 exec(code)
38 has_C = "COL_C" in headers
39 has_D = "COL_D" in headers
40 has_M = "COL_M" in headers
41 has_N = "COL_N" in headers
42 headers = f.readline().split(',')
43 headers_set = set(headers)
44
45 for line in f:
46 elements = line.split(',')
47 if "COL_C" in headers_set:
48 ... # Special processing related to COL_C
49 if "COL_D" in headers_set:
50 ... # Special processing related to COL_D
51 if "COL_M" in headers_set:
52 ... # Special processing related to COL_M
53 if "COL_N" in headers_set:
54 ... # Special processing related to COL_N
55
QUESTION
golang syscall, locked to thread
Asked 2021-Apr-21 at 15:29I am attempting to create an program to scrape xml files. I'm experimenting with go because of it's goroutines. I have several thousand files, so some type of multiprocessing is almost a necessity...
I got a program to successfully run, and convert xml to csv(as a test, not quite the end result), on a test set of files, but when run with the full set of files, it gives this:
1runtime: program exceeds 10000-thread limit
2
I've been looking for similar problems, and theres a couple, but i haven't found one that was similar enough to solve this.
and finally heres some code im running:
1runtime: program exceeds 10000-thread limit
2// main func (start threads)
3
4for i := range filelist {
5 channels = append(channels, make(chan Test))
6 go Parse(files[i], channels[len(channels)-1])
7}
8
9// Parse func (individual threads)
10
11func Parse(fileName string, c chan Test) {
12defer close(c)
13
14doc := etree.NewDocument()
15if err := doc.ReadFromFile(fileName); err != nil {
16 return
17}
18
19root := doc.SelectElement("trc:TestResultsCollection")
20
21for _, test := range root.FindElements("//trc:TestResults/tr:ResultSet/tr:TestGroup/tr:Test") {
22 var outcome Test
23 outcome.StepType = test.FindElement("./tr:Extension/ts:TSStepProperties/ts:StepType").Text()
24 outcome.Result = test.FindElement("./tr:Outcome").Attr[0].Value
25 for _, attr := range test.Attr {
26 if attr.Key == "name" {
27 outcome.Name = attr.Value
28 }
29 }
30
31 for _, attr := range test.FindElement("./tr:TestResult/tr:TestData/c:Datum").Attr {
32 if attr.Key == "value" {
33 outcome.Value = attr.Value
34 }
35 }
36
37 c <- outcome
38}
39
40}
41
42// main (process results when threads return)
43
44for c := 0; c < len(channels); c++ {
45 for i := range channels[c] {
46 // csv processing with i
47 }
48}
49
I'm sure theres some ugly code in there. I've just picked up go recently from other languages...so i apologize in advance. anyhow
any ideas?
ANSWER
Answered 2021-Apr-21 at 15:25I apologize for not including the correct error. as the comments pointed out i was doing something dumb and creating a routine for every file. Thanks to JimB for correcting me, and torek for providing a solution and this link. https://gobyexample.com/worker-pools
1runtime: program exceeds 10000-thread limit
2// main func (start threads)
3
4for i := range filelist {
5 channels = append(channels, make(chan Test))
6 go Parse(files[i], channels[len(channels)-1])
7}
8
9// Parse func (individual threads)
10
11func Parse(fileName string, c chan Test) {
12defer close(c)
13
14doc := etree.NewDocument()
15if err := doc.ReadFromFile(fileName); err != nil {
16 return
17}
18
19root := doc.SelectElement("trc:TestResultsCollection")
20
21for _, test := range root.FindElements("//trc:TestResults/tr:ResultSet/tr:TestGroup/tr:Test") {
22 var outcome Test
23 outcome.StepType = test.FindElement("./tr:Extension/ts:TSStepProperties/ts:StepType").Text()
24 outcome.Result = test.FindElement("./tr:Outcome").Attr[0].Value
25 for _, attr := range test.Attr {
26 if attr.Key == "name" {
27 outcome.Name = attr.Value
28 }
29 }
30
31 for _, attr := range test.FindElement("./tr:TestResult/tr:TestData/c:Datum").Attr {
32 if attr.Key == "value" {
33 outcome.Value = attr.Value
34 }
35 }
36
37 c <- outcome
38}
39
40}
41
42// main (process results when threads return)
43
44for c := 0; c < len(channels); c++ {
45 for i := range channels[c] {
46 // csv processing with i
47 }
48}
49jobs := make(chan string, numJobs)
50results := make(chan []Test, numJobs)
51
52for w := 0; w < numWorkers; w++ {
53 go Worker(w, jobs, results)
54 wg.Add(1)
55}
56
57// give workers jobs
58
59for _, i := range files {
60 if filepath.Ext(i) == ".xml" {
61 jobs <- ("Path to files" + i)
62 }
63}
64
65close(jobs)
66wg.Wait()
67
68//result processing <- results
69
QUESTION
How to break up a string into a vector fast?
Asked 2020-Jul-31 at 21:54I am processing CSV and using the following code to process a single line.
1std::vector<std::string> string_to_vector(const std::string& s, const char delimiter, const char escape) {
2 std::stringstream sstr{s};
3 std::vector<std::string> result;
4 while (sstr.good()) {
5 std::string substr;
6 getline(sstr, substr, delimiter);
7 while (substr.back() == escape) {
8 std::string tmp;
9 getline(sstr, tmp, delimiter);
10 substr += "," + tmp;
11 }
12 result.emplace_back(substr);
13 }
14 return result;
15}
16
What it does: Function breaks up string s
based on delimiter
. If the delimiter is escaped with escape
the delimiter will be ignored.
This code works but is super slow. How can I speed it up?
Do you know any existing csv processing implementation that does exactly this and which I could use?
ANSWER
Answered 2020-Jul-31 at 21:54The fastest way to do something is to not do it at all.
If you can ensure that your source string s
will outlive the use of the returned vector, you could replace your std::vector<std::string>
with std::vector<char*>
which would point to the beginning of each substring. You then replace your identified delimiters with zeroes.
[EDIT] I have not moved up to C++17, so no string_view
for me :)
NOTE: typical CSV is different from what you imply; it doesn't use escape for the comma, but surrounds entries with comma in it with double quotes. But I assume you know your data.
Implementation:
1std::vector<std::string> string_to_vector(const std::string& s, const char delimiter, const char escape) {
2 std::stringstream sstr{s};
3 std::vector<std::string> result;
4 while (sstr.good()) {
5 std::string substr;
6 getline(sstr, substr, delimiter);
7 while (substr.back() == escape) {
8 std::string tmp;
9 getline(sstr, tmp, delimiter);
10 substr += "," + tmp;
11 }
12 result.emplace_back(substr);
13 }
14 return result;
15}
16#include <iostream>
17#include <vector>
18#include <string>
19
20std::vector<char*> string_to_vector(std::string& s,
21 const char delimiter, const char escape)
22{
23 size_t prev(0), pos(0), from(0);
24 std::vector<char*> v;
25 while ((pos = s.find(delimiter, from)) != s.npos)
26 {
27 if (pos == 0 || s[pos - 1] != escape)
28 {
29 s[pos] = 0;
30 v.push_back(&s[prev]);
31 prev = pos + 1;
32 }
33 from = pos + 1;
34 }
35 v.push_back(&s[prev]);
36 return v;
37}
38
39int main() {
40 std::string test("this,is,a\\,test");
41 std::vector<char*> v = string_to_vector(test, ',', '\\');
42
43 for (auto& s : v)
44 std::cout << s << " ";
45}
46
QUESTION
CSV Regex skipping first comma
Asked 2020-May-11 at 22:02I am using regex for CSV processing where data can be in Quotes, or no quotes. But if there is just a comma at the starting column, it skips it.
Here is the regex I am using:
(?:,"|^")(""|[\w\W]*?)(?=",|"$)|(?:,(?!")|^(?!"))([^,]*?|)(?=$|,)
Now the example data I am using is:
,"data",moredata,"Data"
Which should have 4 matches ["","data","moredata","Data"], but it always skips the first comma. It is fine if there is quotes on the first column, or it is not blank, but if it is empty with no quotes, it ignores it.
Here is a sample code I am using for testing purposes, it is written in Dart:
1
2void main() {
3
4 String delimiter = ",";
5 String rawRow = ',,"data",moredata,"Data"';
6RegExp exp = new RegExp(r'(?:'+ delimiter + r'"|^")(^,|""|[\w\W]*?)(?="'+ delimiter + r'|"$)|(?:'+ delimiter + '(?!")|^(?!"))([^'+ delimiter + r']*?)(?=$|'+ delimiter + r')');
7
8
9Iterable<Match> matches = exp.allMatches(rawRow.replaceAll("\n","").replaceAll("\r","").trim());
10List<String> row = new List();
11matches.forEach((Match m) {
12 //This checks to see which match group it found the item in.
13 String cellValue;
14 if (m.group(2) != null) {
15 //Data found without speech marks
16 cellValue = m.group(2);
17 } else if (m.group(1) != null) {
18 //Data found with speech marks (so it removes escaped quotes)
19 cellValue = m.group(1).replaceAll('""', '"');
20 } else {
21 //Anything left
22 cellValue = m.group(0).replaceAll('""', '"');
23 }
24 row.add(cellValue);
25});
26 print(row.toString());
27
28}
29
ANSWER
Answered 2020-May-11 at 22:02Investigating your expression
1
2void main() {
3
4 String delimiter = ",";
5 String rawRow = ',,"data",moredata,"Data"';
6RegExp exp = new RegExp(r'(?:'+ delimiter + r'"|^")(^,|""|[\w\W]*?)(?="'+ delimiter + r'|"$)|(?:'+ delimiter + '(?!")|^(?!"))([^'+ delimiter + r']*?)(?=$|'+ delimiter + r')');
7
8
9Iterable<Match> matches = exp.allMatches(rawRow.replaceAll("\n","").replaceAll("\r","").trim());
10List<String> row = new List();
11matches.forEach((Match m) {
12 //This checks to see which match group it found the item in.
13 String cellValue;
14 if (m.group(2) != null) {
15 //Data found without speech marks
16 cellValue = m.group(2);
17 } else if (m.group(1) != null) {
18 //Data found with speech marks (so it removes escaped quotes)
19 cellValue = m.group(1).replaceAll('""', '"');
20 } else {
21 //Anything left
22 cellValue = m.group(0).replaceAll('""', '"');
23 }
24 row.add(cellValue);
25});
26 print(row.toString());
27
28}
29(,"|^")
30(""|[\w\W]*?)
31(?=",|"$)
32|
33(,(?!")|^(?!"))
34([^,]*?|)
35(?=$|,)
36
(,"|^")(""|[\w\W]*?)(?=",|"$)
This part is to match quoted strings, that seem to work for you
Going through this part (,(?!")|^(?!"))([^,]*?|)(?=$|,)
(,(?!")|^(?!"))
start with comma not followed by " OR start of line not followed by "
([^,]*?|)
Start of line or comma zero or more non greedy and |, why |
(?=$|,)
end of line or , .
In CSV this ,,,3,4,5
line should give 6 matches but the above only gets 5
You could add (^(?=,))
at the begining of second part, the part that matches non quoted sections.
Second group with match of start and also added non capture to groups
(?:^(?=,))|(?:,(?!")|^(?!"))(?:[^,]*?)(?=$|,)
Complete: (?:,"|^")(?:""|[\w\W]*?)(?=",|"$)|(?:^(?=,))|(?:,(?!")|^(?!"))(?:[^,]*?)(?=$|,)
Here is another that might work
(?:(?:"(?:[^"]|"")*"|(?<=,)[^,]*(?=,))|^[^,]+|^(?=,)|[^,]+$|(?<=,)$)
How that works i described here: Build CSV parser using regex
Community Discussions contain sources that include Stack Exchange Network
Tutorials and Learning Resources in CSV Processing
Tutorials and Learning Resources are not available at this moment for CSV Processing