Popular New Releases in CSV Processing
Laravel-Excel
v3.1.33
PapaParse
5.3.0
q
Next Release Development Build
miller
Restore --tsvlite; add gssub and expand dhms functions
visidata
v2.8: Python 3.10 compatibility
Popular Libraries in CSV Processing
by Maatwebsite php
10159 MIT
๐ Supercharged Excel exports and imports in Laravel
by mholt javascript
9757 MIT
Fast and powerful CSV (delimited text) parser that gracefully handles large files and malformed input
by harelba python
8893 GPL-3.0
q - Run SQL directly on delimited files and multi-file sqlite databases
by BurntSushi rust
7540 NOASSERTION
A fast CSV command line toolkit written in Rust.
by mledoze php
5413 ODbL-1.0
World countries in JSON, CSV, XML and Yaml. Any help is welcome!
by johnkerl go
5203 NOASSERTION
Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON
by saulpw python
5081 GPL-3.0
A terminal spreadsheet multitool for discovering and arranging data
by wireservice python
4887 MIT
A suite of utilities for converting to and working with CSV, the king of tabular file formats.
by jazzband python
4102 MIT
Python Module for Tabular Datasets in XLS, CSV, JSON, YAML, &c.
Trending New libraries in CSV Processing
by alexhallam rust
1302 Unlicense
๐บ(tv) Tidy Viewer is a cross-platform CLI csv pretty printer that uses column styling to maximize viewer enjoyment.
by shps951023 csharp
748 Apache-2.0
Fast, Low-Memory, Easy Excel .NET helper to import/export/template spreadsheet
by artperrin python
668 MIT
Convert tables stored as images to an usable .csv file
by githubocto typescript
305 MIT
by github typescript
253 MIT
A set of packages to make exporting artifacts from GitHub easier
by coolbutuseless r
237 MIT
Manipulate CSV files on the command line using dplyr
by Bunlong typescript
228 MIT
react-papaparse is the fastest in-browser CSV (or delimited text) parser for React. It is full of useful features such as CSVReader, CSVDownloader, readString, jsonToCSV, readRemoteFile, ... etc.
by tresorone javascript
192 AGPL-3.0
Extract transactions from PDF statements of brokers/banks or "Portfolio Performance" CSV exports. Compatible with Tresor One activities
by p-ranav c++
187 MIT
Fast CSV parser and writer for Modern C++
Top Authors in CSV Processing
1
9 Libraries
550
2
9 Libraries
32
3
8 Libraries
859
4
6 Libraries
219
5
5 Libraries
279
6
5 Libraries
68
7
5 Libraries
27
8
4 Libraries
67
9
4 Libraries
418
10
4 Libraries
43
1
9 Libraries
550
2
9 Libraries
32
3
8 Libraries
859
4
6 Libraries
219
5
5 Libraries
279
6
5 Libraries
68
7
5 Libraries
27
8
4 Libraries
67
9
4 Libraries
418
10
4 Libraries
43
Trending Kits in CSV Processing
An ordered list of values is called a JSON array. It can store multiple values, strings, numbers, booleans, or objects in a JSON array. A comma must separate the values in the JSON array. A normal text file or stored data in column by column and split by a comma is called a CSV(comma-separated values).
Now will see the procedure to convert the JSON array to CSV,
- Read the data from the JSON file and store the result as a string.
- Construct a JSON object using the above string.
- Get the JSON Array from the JSON Object.
- Create a new CSV file using java. io. File.
- Deliver a comma-delimited text from the JSONArray of JSONObjects and write it to the newly created CSV file.
The JSON can be used as a 'data-interchange format' and it is 'lightweight' and 'language independent'. It can parse text from a string to produce vector-like objects. The advantage of using JSON for data storage, it is safe for transferring the data and suitable across platforms. To store the data comparatively JSON is preferred better than CSV In terms of scalability of application or file and while working with a large volume of data. The most common usage of JSON is used in JavaScript-based applications that have browser extensions and websites as a part of their features.
Here is an example of how you can convert JSON array to CSV in Java:
Fig 1: Preview of the output that you will get on running this code from your IDE
Code
Instructions
- Copy the code using the "Copy" button above, and paste it in a Java file in your IDE.
- Add the required dependencies and import them in java file.
- Run the file to generate the output csv file.
I hope you found this useful. I have added the link to dependent libraries, version information in the following sections.
I found this code snippet by searching for 'json array list to csv format' in kandi. You can try any such use case!
Environment Tested
I tested this solution in the following versions. Be mindful of changes when working with other versions.
- The solution is created in Java 11.0.17.
- The solution is tested on JSON Version:20210307 and apache.commons:commons-io:1.3.2.
Using this solution, we are able to convert an json array to csv with simple steps. This process also facilities an easy way to use, hassle-free method to create a hands-on working version of code which would help us to convert an json array to csv.
Dependent Libraries
You can add the dependent library in your gradle or maven files. you can get the dependancy xml in above link
You can search for any dependent library on kandi like apache commons io and json java
Support
- For any support on kandi solution kits, please use the chat
- For further learning resources, visit the Open Weaver Community learning page.
The technique for converting a JSON array to a CSV file using Apache Commons IO in Java can be helpful in several situations where you want to export data stored in a JSON array to a CSV file. This may be helpful in the following cases, for instance:
- converting data saved in a database or other storage system that is represented as a JSON array;
- exporting data from a web application or API that returns data in JSON format;
- Transforming data from one format to another is part of the ETL (extract, transform, load) process.
It offers a large selection of classes and methods that can be utilized to carry out different I/O-related operations, including reading and writing files, navigating directories and files, reading and writing to input and output streams, and more. When working with I/O operations in Java, Apache Commons IO is a valuable tool in your toolbox because it is a widely used library.
Here is an example of how you can convert JSON array to CSV using Apache common-io in Java for your application:
Fig 1: Preview of the output that you will get on running this code from your IDE
Code
- Copy the code using the "Copy" button above, and paste it in a Java file in your IDE.
- Add dependent library and import in java file.
- Run the file to generate csv file.
I hope you found this useful. I have added the link to dependent libraries, version information in the following sections.
I found this code snippet by searching for "json array to csv in java" in kandi. You can try any such use case!
Development Libraries
You can add the dependent library in your gradle or maven files. you can get the dependancy xml in above link
You can search for any dependent library on kandi like apache commons io and json java
Support
- For any support on kandi solution kits, please use the chat
- For further learning resources, visit the Open Weaver Community learning page.
A CSV parser library is a software tool. It reads and processes Comma Separated Values (CSV) files. It helps developers parse CSV files and extract data for processing or analysis. It will provide a convenient way to work with tabular data.
CSV files can range from simple text files containing rows and columns of data. It can last to complex data tables with various delimiters and formats. A CSV parser library can handle variations like headers and column separators. Even it contains those with special characters and encoding formats.
CSV parser libraries offer various features to handle CSV data. They provide functions to parse CSV text into structured data. It also offers functions to convert CSV and validate CSV data against standards. It also helps generate CSV output from data sources. Some libraries also support streaming processing for large datasets or remote files.
We can use the CSV parser library in a wide range of applications. We can employ it in command-line tools for data manipulation. We can import/export tasks, database interactions, or generate reports in CSV format. We can use it in web applications for handling CSV uploads. It can help in parsing user-generated data or integrating with data visualization frameworks.
When using a CSV parser library, choosing a library is vital. Consider factors like performance, memory usage, data validation support, streaming capabilities, and support. You must familiarize yourself with the library's API and documentation. You must understand how to configure delimiters and headers. Also, you must explore the available data extraction and transformation functions.
When working with a CSV parser library, ensure you handle data types. CSV files may contain numeric values, dates, or strings. It will use appropriate data types during parsing for accurate data processing. Additionally, pay attention to potential issues like missing columns. You must also think about header specifications or encoding formats to avoid errors.
A CSV parser library provides a powerful toolset. It helps developers parse, validate, transform, and extract data. By following best practices, developers can streamline data processing tasks. They can work with large datasets and ensure data integrity and compatibility.
In conclusion, a CSV parser library offers a valuable solution for handling CSV data. It offers features like data parsing, formatting, and validation support. It offers developers the flexibility and efficiency needed to process tabular data. By leveraging these libraries, developers can enhance their data processing workflows. They can improve data accuracy and leverage the support available for these tools.
PapaParse:
- This library supports parsing CSV files in Node.js with a simple and easy-to-use API.
- It helps handle CSV data with different delimiters, line endings, and custom formats.
- It supports asynchronous parsing, streaming, and error handling.
- It is suitable for handling large datasets or real-time data processing.
neat-csv:
- This library parses CSV data and returns an array of JavaScript objects.
- It helps in handling CSV files with complex structures and different data types.
- It supports options for custom delimiter, headers, and data transformation.
- It makes it useful for data extraction and manipulation tasks.
nodejs-csvtojson:
- This library supports converting CSV data into JSON format in Node.js.
- It helps transform CSV data into JSON format, facilitating integration and interoperability.
- It supports custom headers, delimiter, datatype conversion, and empty value handling.
- It will enable flexible and accurate data conversion.
fast-csv:
- This library helps in parsing CSV files in Node.js.
- It supports reading from and writing to CSV files.
- It will make it suitable for data import/export, analysis, and transformation tasks.
csv-parser:
- This library parses CSV data and converts it into JavaScript objects or arrays.
- Streaming the data helps process large CSV files.
- It provides better memory management and performance.
- It supports custom transformations and handling of CSV headers.
- It makes it versatile for various data processing tasks.
node-csv-parse:
- This library offers a flexible and efficient CSV parsing solution in Node.js.
- It helps in parsing data with options for delimiter, character, and escape character.
- It supports various input sources, including file streams, buffers, or strings.
- It allows easy integration with different data sources or APIs.
neat-csv:
- This library parses CSV data and returns an array of JavaScript objects.
- It helps in handling CSV files with complex structures and different data types.
- It supports options for custom delimiter, headers, and data transformation.
- It makes it useful for data extraction and manipulation tasks.
csv-writer:
- This library supports generating CSV files in Node.js by writing data to streams or files.
- It helps create CSV files with customizable headers, delimiters, and quote characters.
- It supports efficient batch writing and handling of large datasets.
- It will make it useful for exporting data or generating reports in CSV format.
csv-parse:
- This library parses CSV data in Node.js, focusing on simplicity and performance.
- It supports parsing CSV strings or streams.
- It provides options for the delimiter, quote character, and handling of special characters.
- It helps extract data from CSV files for further processing or analysis.
FAQ
1. What is a CSV parser, and how does it work?
CSV stands for Comma Separated Values. It is a plain text format used for representing tabular data. A CSV parser is a software component or library that allows developers to read and process CSV files. The parser analyzes the structure of the CSV file and splits it into rows and columns. It provides access to the data contained within.
2. Is there a CSV acid test suite for validating the performance of a parser library?
Yes, there is an industry-standard CSV acid test suite called "csv-spectrum." It validates the compliance and performance of CSV parser libraries. The csv-spectrum test suite contains test cases covering various aspects of CSV parsing. The aspects like edge cases, corner cases, and compatibility with the CSV standard. Running a parser library against this test suite helps ensure its reliability. It helps ensure adherence to the CSV specification.
3. How can I create the desired CSV file from the data available in my database?
To create a desired CSV file in your database, you can use a nodejs CSV parser library. It helps read the data from the database and format it into the CSV file structure. You can also extract the relevant data fields and format them with delimiters. By iterating through the records, you can generate a file to represent your contents.
4. What features of the Papa Parse wrapper make it an ideal choice for the nodejs CSV parser library?
PapaParse is a popular wrapper for CSV parsing in nodejs. It offers features making it an ideal choice for a CSV parser library, including:
- Stream-based parsing:
PapaParse supports reading CSV data from a stream. It helps make it efficient for parsing large CSV files without loading the entire file.
- Wide adoption and community support:
PapaParse has a community that is tested, maintained, and updated with fixes.
- Cross-platform compatibility:
PapaParse is compatible with many platforms. We can make it suitable for server and client-side applications.
- Customizable parsing options:
It offers configuration options like delimiter selection, header treatment, and dynamic typing. It will allow developers to adapt the parsing process to their needs.
5. What is Comma Separated Values (CSV), and how is it different from JSON files?
Comma Separated Values (CSV) is a plain text format representing tabular data. It consists of rows, each containing values separated by commas. CSV files are human-readable and supported by different applications. It helps in making them a common choice for data exchange. But JSON files use the JavaScript Object Notation format. It represents data as key-value pairs or nested structures. JSON is versatile. It supports complex data structures. But CSV is simpler and used for tabular data representation.
6. Does Byte Order Mark impact nodejs CSV parser library operations?
The Byte Order Mark (BOM) is a special marker that indicates the encoding of a text file. In the context of a nodejs CSV parser library, a BOM at the beginning of a CSV file might impact the parsing process. These libraries interpret the BOM as data, resulting in unexpected behavior. It is vital to handle the presence of a BOM by removing it before parsing. It can also handle by configuring the parser library to handle BOM.
7. How can I ensure the row data produced by my nodejs CSV parser library is accurate with source contents?
To ensure accurate row data produced by a nodejs CSV parser library, you can follow a few best practices:
- Validate the CSV structure:
Check if the CSV file adheres to the expected structure. You can include the correct number of columns and proper delimiters.
- Handle missing or unexpected data:
Implement error-handling routines to handle missing values, formatted rows, or inconsistent data types.
- Verify data integrity:
Compare the parsed row data with the original CSV or source data. It will ensure that the parsing process accurately represents the contents.
- Perform data cleaning and normalization:
Apply transformations or data cleaning steps to ensure consistency in the row data.
You can use the pandasโ library in Python to append data to an existing table. Appending data to an existing table is a way to add new rows of data without modifying or deleting the existing data. This is helpful if you wish to update or add new data to a database while maintaining a historical record of the old data. append(): The append() method in pandas is used to add rows to a DataFrame, and return a new DataFrame with the newly added rows. The original DataFrame remains unchanged. The new DataFrame can be assigned to a variable, which can then be used for further processing or analysis.
- pd.concat(): pd.concat() is a function in the pandas library that is used to concatenate or join multiple DataFrames along a particular axis (axis=0 for rows and axis=1 for columns).
pd.concat() is similar to the append() method, but it can be used to concatenate DataFrames along either the rows or columns axis, and it can also take a list of DataFrames as input, whereas append() can only take one DataFrame at a time and concatenate along the rows axis.
For better knowledge of appending data in existing table using Pandas, you may have a look at the code below.
Fig : Preview of the output that you will get on running this code from your IDE.
Code
In this solution we're using Pandas library.
Instructions
Follow the steps carefully to get the output easily.
- Install pandas on your IDE(Any of your favorite IDE).
- Copy the snippet using the 'copy' and paste it in your IDE.
- Add required dependencies and import them in Python file.
- Run the file to generate the output.
I hope you found this useful. I have added the link to dependent libraries, version information in the following sections.
I found this code snippet by searching for 'how to append data in existing table using pandas' in kandi. You can try any such use case!
Environment Tested
I tested this solution in the following versions. Be mindful of changes when working with other versions.
- The solution is created in PyCharm 2021.3.
- The solution is tested on Python 3.9.7.
- Pandas version-v1.5.2.
Using this solution, we are able to append data in existing table using pandas with simple steps. This process also facilities an easy way to use, hassle-free method to create a hands-on working version of code which would help us to append data in existing table using pandas.
Dependent Library
You can also search for any dependent libraries on kandi like 'pandas'.
Support
- For any support on kandi solution kits, please use the chat
- For further learning resources, visit the Open Weaver Community learning page.
Trending Discussions on CSV Processing
Peformance issues reading CSV files in a Java (Spring Boot) application
Inserting json column in Bigquery
Avoid repeated checks in loop
golang syscall, locked to thread
How to break up a string into a vector fast?
CSV Regex skipping first comma
QUESTION
Peformance issues reading CSV files in a Java (Spring Boot) application
Asked 2022-Jan-29 at 12:37I am currently working on a spring based API which has to transform csv data and to expose them as json. it has to read big CSV files which will contain more than 500 columns and 2.5 millions lines each. I am not guaranteed to have the same header between files (each file can have a completly different header than another), so I have no way to create a dedicated class which would provide mapping with the CSV headers. Currently the api controller is calling a csv service which reads the CSV data using a BufferReader.
The code works fine on my local machine but it is very slow : it takes about 20 seconds to process 450 columns and 40 000 lines. To improve speed processing, I tried to implement multithreading with Callable(s) but I am not familiar with that kind of concept, so the implementation might be wrong.
Other than that the api is running out of heap memory when running on the server, I know that a solution would be to enhance the amount of available memory but I suspect that the replace() and split() operations on strings made in the Callable(s) are responsible for consuming a large amout of heap memory.
So I actually have several questions :
#1. How could I improve the speed of the CSV reading ?
#2. Is the multithread implementation with Callable correct ?
#3. How could I reduce the amount of heap memory used in the process ?
#4. Do you know of a different approach to split at comas and replace the double quotes in each CSV line ? Would StringBuilder be of any healp here ? What about StringTokenizer ?
Here below the CSV method
1 public static final int NUMBER_OF_THREADS = 10;
2
3 public static List<List<String>> readCsv(InputStream inputStream) {
4 List<List<String>> rowList = new ArrayList<>();
5 ExecutorService pool = Executors.newFixedThreadPool(NUMBER_OF_THREADS);
6 List<Future<List<String>>> listOfFutures = new ArrayList<>();
7 try {
8 BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream, StandardCharsets.UTF_8));
9 String line = null;
10 while ((line = reader.readLine()) != null) {
11 CallableLineReader callableLineReader = new CallableLineReader(line);
12 Future<List<String>> futureCounterResult = pool.submit(callableLineReader);
13 listOfFutures.add(futureCounterResult);
14 }
15 reader.close();
16 pool.shutdown();
17 } catch (Exception e) {
18 //log Error reading csv file
19 }
20
21 for (Future<List<String>> future : listOfFutures) {
22 try {
23 List<String> row = future.get();
24 }
25 catch ( ExecutionException | InterruptedException e) {
26 //log Error CSV processing interrupted during execution
27 }
28 }
29
30 return rowList;
31 }
32
And the Callable implementation
1 public static final int NUMBER_OF_THREADS = 10;
2
3 public static List<List<String>> readCsv(InputStream inputStream) {
4 List<List<String>> rowList = new ArrayList<>();
5 ExecutorService pool = Executors.newFixedThreadPool(NUMBER_OF_THREADS);
6 List<Future<List<String>>> listOfFutures = new ArrayList<>();
7 try {
8 BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream, StandardCharsets.UTF_8));
9 String line = null;
10 while ((line = reader.readLine()) != null) {
11 CallableLineReader callableLineReader = new CallableLineReader(line);
12 Future<List<String>> futureCounterResult = pool.submit(callableLineReader);
13 listOfFutures.add(futureCounterResult);
14 }
15 reader.close();
16 pool.shutdown();
17 } catch (Exception e) {
18 //log Error reading csv file
19 }
20
21 for (Future<List<String>> future : listOfFutures) {
22 try {
23 List<String> row = future.get();
24 }
25 catch ( ExecutionException | InterruptedException e) {
26 //log Error CSV processing interrupted during execution
27 }
28 }
29
30 return rowList;
31 }
32public class CallableLineReader implements Callable<List<String>> {
33
34 private final String line;
35
36 public CallableLineReader(String line) {
37 this.line = line;
38 }
39
40 @Override
41 public List<String> call() throws Exception {
42 return Arrays.asList(line.replace("\"", "").split(","));
43 }
44}
45
ANSWER
Answered 2022-Jan-29 at 02:56I don't think that splitting this work onto multiple threads is going to provide much improvement, and may in fact make the problem worse by consuming even more memory. The main problem is using too much heap memory, and the performance problem is likely to be due to excessive garbage collection when the remaining available heap is very small (but it's best to measure and profile to determine the exact cause of performance problems).
The memory consumption would be less from the replace
and split
operations, and more from the fact that the entire contents of the file need to be read into memory in this approach. Each line may not consume much memory, but multiplied by millions of lines, it all adds up.
If you have enough memory available on the machine to assign a heap size large enough to hold the entire contents, that will be the simplest solution, as it won't require changing the code.
Otherwise, the best way to deal with large amounts of data in a bounded amount of memory is to use a streaming approach. This means that each line of the file is processed and then passed directly to the output, without collecting all of the lines in memory in between. This will require changing the method signature to use a return type other than List
. Assuming you are using Java 8 or later, the Stream
API can be very helpful. You could rewrite the method like this:
1 public static final int NUMBER_OF_THREADS = 10;
2
3 public static List<List<String>> readCsv(InputStream inputStream) {
4 List<List<String>> rowList = new ArrayList<>();
5 ExecutorService pool = Executors.newFixedThreadPool(NUMBER_OF_THREADS);
6 List<Future<List<String>>> listOfFutures = new ArrayList<>();
7 try {
8 BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream, StandardCharsets.UTF_8));
9 String line = null;
10 while ((line = reader.readLine()) != null) {
11 CallableLineReader callableLineReader = new CallableLineReader(line);
12 Future<List<String>> futureCounterResult = pool.submit(callableLineReader);
13 listOfFutures.add(futureCounterResult);
14 }
15 reader.close();
16 pool.shutdown();
17 } catch (Exception e) {
18 //log Error reading csv file
19 }
20
21 for (Future<List<String>> future : listOfFutures) {
22 try {
23 List<String> row = future.get();
24 }
25 catch ( ExecutionException | InterruptedException e) {
26 //log Error CSV processing interrupted during execution
27 }
28 }
29
30 return rowList;
31 }
32public class CallableLineReader implements Callable<List<String>> {
33
34 private final String line;
35
36 public CallableLineReader(String line) {
37 this.line = line;
38 }
39
40 @Override
41 public List<String> call() throws Exception {
42 return Arrays.asList(line.replace("\"", "").split(","));
43 }
44}
45public static Stream<List<String>> readCsv(InputStream inputStream) {
46 BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream, StandardCharsets.UTF_8));
47 return reader.lines().map(line -> Arrays.asList(line.replace("\"", "").split(",")));
48}
49
Note that this throws unchecked exceptions in case of an I/O error.
This will read and transform each line of input as needed by the caller of the method, and will allow previous lines to be garbage collected if they are no longer referenced. This then requires that the caller of this method also consume the data line by line, which can be tricky when generating JSON. The JakartaEE JsonGenerator
API offers one possible approach. If you need help with this part of it, please open a new question including details of how you're currently generating JSON.
QUESTION
Inserting json column in Bigquery
Asked 2021-Jun-02 at 06:55I have a JSON that I want to insert into BQ. The column data type is STRING. Here is the sample JSON value.
1"{\"a\":\"#\",\"b\":\"b value\"}"
2
This is a bulk load from a CSV file.
The error I'm getting is
1"{\"a\":\"#\",\"b\":\"b value\"}"
2Error: Data between close double quote (\") and field separator."; Reason: "invalid"},Invalid {Location: ""; Message: "Error while reading data, error message: CSV processing encountered too many errors, giving up. Rows: 0; errors: 1; max bad: 0; error percent: 0"; Reason: "invalid"}
3
Thanks!
ANSWER
Answered 2021-Jun-02 at 06:55I think there is an issue with how you escape the double quotes.
I could reproduce the issue you describe, and fixed it by escaping the double quotes with "
instead of a backslash \
:
1"{\"a\":\"#\",\"b\":\"b value\"}"
2Error: Data between close double quote (\") and field separator."; Reason: "invalid"},Invalid {Location: ""; Message: "Error while reading data, error message: CSV processing encountered too many errors, giving up. Rows: 0; errors: 1; max bad: 0; error percent: 0"; Reason: "invalid"}
3"{""a"":""#"",""b"":""b value""}"
4
This information is well-hidden in the doc there (in the "Quote" section):
For example, if you want to escape the default character ' " ', use ' "" '.
QUESTION
Avoid repeated checks in loop
Asked 2021-Apr-23 at 11:51I'm sorry if this has been asked before. It probably has, but I just have not been able to find it. On with the question:
I often have loops which are initialized with certain conditions that affect or (de)activate certain behaviors inside them, but do not drastically change the general loop logic. These conditions do not change through the loop's operation, but have to be checked every iteration anyways. Is there a way to optimized said loop in a pythonic way to avoid doing the same check every single time? I understand this would be a compiler's job in any compiled language, but there ain't no compiler here.
Now, for a specific example, imagine I have a function that parses a CSV file with a format somewhat like this, where you do not know in advance the columns that will be present on it:
1COL_A,COL_B,COL_F,COL_H,COL_M,COL_T,COL_Z
21,2,3,4,5,6,7
38,9,10,11,12,13,14
4...
5
And you have this function to manually process the file (I know there are better ways to deal with CSVs, the question is about optimizing the loop, not about CSV processing). For one reason or another, columns COL_C
, COL_D
, COL_M
and COL_N
(to name a few) need special processing. This would result in something like:
1COL_A,COL_B,COL_F,COL_H,COL_M,COL_T,COL_Z
21,2,3,4,5,6,7
38,9,10,11,12,13,14
4...
5def process_csv(file):
6 with open(file, 'r') as f:
7 headers = f.readline().split(',')
8 has_C = "COL_C" in headers
9 has_D = "COL_D" in headers
10 has_M = "COL_M" in headers
11 has_N = "COL_N" in headers
12
13 for line in f:
14 elements = line.split(',')
15 if has_C:
16 ... # Special processing related to COL_C
17 if has_D:
18 ... # Special processing related to COL_D
19 if has_M:
20 ... # Special processing related to COL_M
21 if has_N:
22 ... # Special processing related to COL_N
23 ... # General processing, common to all iterations
24
As I said above, any way to factor out the checks in some way? It may not represent a noticeable impact for this example, but if you have 50 special conditions inside the loop, you end up doing 50 'unnecessary' checks for every single iteration.
--------------- EDIT ------------------
As another example of what I would be looking for, this is (very) evil code that optimizes the loop by not doing any check in it, but instead constructing the very loop itself according to the starting conditions. I suppose for a (VERY) long loop with MANY conditions, this solution may eventually be faster. This depends on how exec is handled, though, which I am not sure since I find it something to avoid...
1COL_A,COL_B,COL_F,COL_H,COL_M,COL_T,COL_Z
21,2,3,4,5,6,7
38,9,10,11,12,13,14
4...
5def process_csv(file):
6 with open(file, 'r') as f:
7 headers = f.readline().split(',')
8 has_C = "COL_C" in headers
9 has_D = "COL_D" in headers
10 has_M = "COL_M" in headers
11 has_N = "COL_N" in headers
12
13 for line in f:
14 elements = line.split(',')
15 if has_C:
16 ... # Special processing related to COL_C
17 if has_D:
18 ... # Special processing related to COL_D
19 if has_M:
20 ... # Special processing related to COL_M
21 if has_N:
22 ... # Special processing related to COL_N
23 ... # General processing, common to all iterations
24def new_process_csv(file):
25 with open(file, 'r') as f:
26 headers = f.readline().split(',')
27 code = \
28f'''
29for line in f:
30 elements = line.split(',')
31 {... if "COL_C" in headers else ''}
32 {... if "COL_D" in headers else ''}
33 {... if "COL_M" in headers else ''}
34 {... if "COL_N" in headers else ''}
35 ... # General processing
36'''
37 exec(code)
38
ANSWER
Answered 2021-Apr-23 at 11:36Your code seems right to me, performance-wise.
You are doing your checks at the beginning of the loop:
1COL_A,COL_B,COL_F,COL_H,COL_M,COL_T,COL_Z
21,2,3,4,5,6,7
38,9,10,11,12,13,14
4...
5def process_csv(file):
6 with open(file, 'r') as f:
7 headers = f.readline().split(',')
8 has_C = "COL_C" in headers
9 has_D = "COL_D" in headers
10 has_M = "COL_M" in headers
11 has_N = "COL_N" in headers
12
13 for line in f:
14 elements = line.split(',')
15 if has_C:
16 ... # Special processing related to COL_C
17 if has_D:
18 ... # Special processing related to COL_D
19 if has_M:
20 ... # Special processing related to COL_M
21 if has_N:
22 ... # Special processing related to COL_N
23 ... # General processing, common to all iterations
24def new_process_csv(file):
25 with open(file, 'r') as f:
26 headers = f.readline().split(',')
27 code = \
28f'''
29for line in f:
30 elements = line.split(',')
31 {... if "COL_C" in headers else ''}
32 {... if "COL_D" in headers else ''}
33 {... if "COL_M" in headers else ''}
34 {... if "COL_N" in headers else ''}
35 ... # General processing
36'''
37 exec(code)
38 has_C = "COL_C" in headers
39 has_D = "COL_D" in headers
40 has_M = "COL_M" in headers
41 has_N = "COL_N" in headers
42
Inside the loop, you are not doing unnecessary checks, you are just checking the result of something you already have computed, which is super-fast and really does not need optimization. You can run a profiler on your code to convince yourself of that: https://docs.python.org/3/library/profile.html
If you are looking for readability, you may want to:
- put the special cases in other methods.
- store the headers in a set to check membership:
1COL_A,COL_B,COL_F,COL_H,COL_M,COL_T,COL_Z
21,2,3,4,5,6,7
38,9,10,11,12,13,14
4...
5def process_csv(file):
6 with open(file, 'r') as f:
7 headers = f.readline().split(',')
8 has_C = "COL_C" in headers
9 has_D = "COL_D" in headers
10 has_M = "COL_M" in headers
11 has_N = "COL_N" in headers
12
13 for line in f:
14 elements = line.split(',')
15 if has_C:
16 ... # Special processing related to COL_C
17 if has_D:
18 ... # Special processing related to COL_D
19 if has_M:
20 ... # Special processing related to COL_M
21 if has_N:
22 ... # Special processing related to COL_N
23 ... # General processing, common to all iterations
24def new_process_csv(file):
25 with open(file, 'r') as f:
26 headers = f.readline().split(',')
27 code = \
28f'''
29for line in f:
30 elements = line.split(',')
31 {... if "COL_C" in headers else ''}
32 {... if "COL_D" in headers else ''}
33 {... if "COL_M" in headers else ''}
34 {... if "COL_N" in headers else ''}
35 ... # General processing
36'''
37 exec(code)
38 has_C = "COL_C" in headers
39 has_D = "COL_D" in headers
40 has_M = "COL_M" in headers
41 has_N = "COL_N" in headers
42 headers = f.readline().split(',')
43 headers_set = set(headers)
44
45 for line in f:
46 elements = line.split(',')
47 if "COL_C" in headers_set:
48 ... # Special processing related to COL_C
49 if "COL_D" in headers_set:
50 ... # Special processing related to COL_D
51 if "COL_M" in headers_set:
52 ... # Special processing related to COL_M
53 if "COL_N" in headers_set:
54 ... # Special processing related to COL_N
55
QUESTION
golang syscall, locked to thread
Asked 2021-Apr-21 at 15:29I am attempting to create an program to scrape xml files. I'm experimenting with go because of it's goroutines. I have several thousand files, so some type of multiprocessing is almost a necessity...
I got a program to successfully run, and convert xml to csv(as a test, not quite the end result), on a test set of files, but when run with the full set of files, it gives this:
1runtime: program exceeds 10000-thread limit
2
I've been looking for similar problems, and theres a couple, but i haven't found one that was similar enough to solve this.
and finally heres some code im running:
1runtime: program exceeds 10000-thread limit
2// main func (start threads)
3
4for i := range filelist {
5 channels = append(channels, make(chan Test))
6 go Parse(files[i], channels[len(channels)-1])
7}
8
9// Parse func (individual threads)
10
11func Parse(fileName string, c chan Test) {
12defer close(c)
13
14doc := etree.NewDocument()
15if err := doc.ReadFromFile(fileName); err != nil {
16 return
17}
18
19root := doc.SelectElement("trc:TestResultsCollection")
20
21for _, test := range root.FindElements("//trc:TestResults/tr:ResultSet/tr:TestGroup/tr:Test") {
22 var outcome Test
23 outcome.StepType = test.FindElement("./tr:Extension/ts:TSStepProperties/ts:StepType").Text()
24 outcome.Result = test.FindElement("./tr:Outcome").Attr[0].Value
25 for _, attr := range test.Attr {
26 if attr.Key == "name" {
27 outcome.Name = attr.Value
28 }
29 }
30
31 for _, attr := range test.FindElement("./tr:TestResult/tr:TestData/c:Datum").Attr {
32 if attr.Key == "value" {
33 outcome.Value = attr.Value
34 }
35 }
36
37 c <- outcome
38}
39
40}
41
42// main (process results when threads return)
43
44for c := 0; c < len(channels); c++ {
45 for i := range channels[c] {
46 // csv processing with i
47 }
48}
49
I'm sure theres some ugly code in there. I've just picked up go recently from other languages...so i apologize in advance. anyhow
any ideas?
ANSWER
Answered 2021-Apr-21 at 15:25I apologize for not including the correct error. as the comments pointed out i was doing something dumb and creating a routine for every file. Thanks to JimB for correcting me, and torek for providing a solution and this link. https://gobyexample.com/worker-pools
1runtime: program exceeds 10000-thread limit
2// main func (start threads)
3
4for i := range filelist {
5 channels = append(channels, make(chan Test))
6 go Parse(files[i], channels[len(channels)-1])
7}
8
9// Parse func (individual threads)
10
11func Parse(fileName string, c chan Test) {
12defer close(c)
13
14doc := etree.NewDocument()
15if err := doc.ReadFromFile(fileName); err != nil {
16 return
17}
18
19root := doc.SelectElement("trc:TestResultsCollection")
20
21for _, test := range root.FindElements("//trc:TestResults/tr:ResultSet/tr:TestGroup/tr:Test") {
22 var outcome Test
23 outcome.StepType = test.FindElement("./tr:Extension/ts:TSStepProperties/ts:StepType").Text()
24 outcome.Result = test.FindElement("./tr:Outcome").Attr[0].Value
25 for _, attr := range test.Attr {
26 if attr.Key == "name" {
27 outcome.Name = attr.Value
28 }
29 }
30
31 for _, attr := range test.FindElement("./tr:TestResult/tr:TestData/c:Datum").Attr {
32 if attr.Key == "value" {
33 outcome.Value = attr.Value
34 }
35 }
36
37 c <- outcome
38}
39
40}
41
42// main (process results when threads return)
43
44for c := 0; c < len(channels); c++ {
45 for i := range channels[c] {
46 // csv processing with i
47 }
48}
49jobs := make(chan string, numJobs)
50results := make(chan []Test, numJobs)
51
52for w := 0; w < numWorkers; w++ {
53 go Worker(w, jobs, results)
54 wg.Add(1)
55}
56
57// give workers jobs
58
59for _, i := range files {
60 if filepath.Ext(i) == ".xml" {
61 jobs <- ("Path to files" + i)
62 }
63}
64
65close(jobs)
66wg.Wait()
67
68//result processing <- results
69
QUESTION
How to break up a string into a vector fast?
Asked 2020-Jul-31 at 21:54I am processing CSV and using the following code to process a single line.
1std::vector<std::string> string_to_vector(const std::string& s, const char delimiter, const char escape) {
2 std::stringstream sstr{s};
3 std::vector<std::string> result;
4 while (sstr.good()) {
5 std::string substr;
6 getline(sstr, substr, delimiter);
7 while (substr.back() == escape) {
8 std::string tmp;
9 getline(sstr, tmp, delimiter);
10 substr += "," + tmp;
11 }
12 result.emplace_back(substr);
13 }
14 return result;
15}
16
What it does: Function breaks up string s
based on delimiter
. If the delimiter is escaped with escape
the delimiter will be ignored.
This code works but is super slow. How can I speed it up?
Do you know any existing csv processing implementation that does exactly this and which I could use?
ANSWER
Answered 2020-Jul-31 at 21:54The fastest way to do something is to not do it at all.
If you can ensure that your source string s
will outlive the use of the returned vector, you could replace your std::vector<std::string>
with std::vector<char*>
which would point to the beginning of each substring. You then replace your identified delimiters with zeroes.
[EDIT] I have not moved up to C++17, so no string_view
for me :)
NOTE: typical CSV is different from what you imply; it doesn't use escape for the comma, but surrounds entries with comma in it with double quotes. But I assume you know your data.
Implementation:
1std::vector<std::string> string_to_vector(const std::string& s, const char delimiter, const char escape) {
2 std::stringstream sstr{s};
3 std::vector<std::string> result;
4 while (sstr.good()) {
5 std::string substr;
6 getline(sstr, substr, delimiter);
7 while (substr.back() == escape) {
8 std::string tmp;
9 getline(sstr, tmp, delimiter);
10 substr += "," + tmp;
11 }
12 result.emplace_back(substr);
13 }
14 return result;
15}
16#include <iostream>
17#include <vector>
18#include <string>
19
20std::vector<char*> string_to_vector(std::string& s,
21 const char delimiter, const char escape)
22{
23 size_t prev(0), pos(0), from(0);
24 std::vector<char*> v;
25 while ((pos = s.find(delimiter, from)) != s.npos)
26 {
27 if (pos == 0 || s[pos - 1] != escape)
28 {
29 s[pos] = 0;
30 v.push_back(&s[prev]);
31 prev = pos + 1;
32 }
33 from = pos + 1;
34 }
35 v.push_back(&s[prev]);
36 return v;
37}
38
39int main() {
40 std::string test("this,is,a\\,test");
41 std::vector<char*> v = string_to_vector(test, ',', '\\');
42
43 for (auto& s : v)
44 std::cout << s << " ";
45}
46
QUESTION
CSV Regex skipping first comma
Asked 2020-May-11 at 22:02I am using regex for CSV processing where data can be in Quotes, or no quotes. But if there is just a comma at the starting column, it skips it.
Here is the regex I am using:
(?:,"|^")(""|[\w\W]*?)(?=",|"$)|(?:,(?!")|^(?!"))([^,]*?|)(?=$|,)
Now the example data I am using is:
,"data",moredata,"Data"
Which should have 4 matches ["","data","moredata","Data"], but it always skips the first comma. It is fine if there is quotes on the first column, or it is not blank, but if it is empty with no quotes, it ignores it.
Here is a sample code I am using for testing purposes, it is written in Dart:
1
2void main() {
3
4 String delimiter = ",";
5 String rawRow = ',,"data",moredata,"Data"';
6RegExp exp = new RegExp(r'(?:'+ delimiter + r'"|^")(^,|""|[\w\W]*?)(?="'+ delimiter + r'|"$)|(?:'+ delimiter + '(?!")|^(?!"))([^'+ delimiter + r']*?)(?=$|'+ delimiter + r')');
7
8
9Iterable<Match> matches = exp.allMatches(rawRow.replaceAll("\n","").replaceAll("\r","").trim());
10List<String> row = new List();
11matches.forEach((Match m) {
12 //This checks to see which match group it found the item in.
13 String cellValue;
14 if (m.group(2) != null) {
15 //Data found without speech marks
16 cellValue = m.group(2);
17 } else if (m.group(1) != null) {
18 //Data found with speech marks (so it removes escaped quotes)
19 cellValue = m.group(1).replaceAll('""', '"');
20 } else {
21 //Anything left
22 cellValue = m.group(0).replaceAll('""', '"');
23 }
24 row.add(cellValue);
25});
26 print(row.toString());
27
28}
29
ANSWER
Answered 2020-May-11 at 22:02Investigating your expression
1
2void main() {
3
4 String delimiter = ",";
5 String rawRow = ',,"data",moredata,"Data"';
6RegExp exp = new RegExp(r'(?:'+ delimiter + r'"|^")(^,|""|[\w\W]*?)(?="'+ delimiter + r'|"$)|(?:'+ delimiter + '(?!")|^(?!"))([^'+ delimiter + r']*?)(?=$|'+ delimiter + r')');
7
8
9Iterable<Match> matches = exp.allMatches(rawRow.replaceAll("\n","").replaceAll("\r","").trim());
10List<String> row = new List();
11matches.forEach((Match m) {
12 //This checks to see which match group it found the item in.
13 String cellValue;
14 if (m.group(2) != null) {
15 //Data found without speech marks
16 cellValue = m.group(2);
17 } else if (m.group(1) != null) {
18 //Data found with speech marks (so it removes escaped quotes)
19 cellValue = m.group(1).replaceAll('""', '"');
20 } else {
21 //Anything left
22 cellValue = m.group(0).replaceAll('""', '"');
23 }
24 row.add(cellValue);
25});
26 print(row.toString());
27
28}
29(,"|^")
30(""|[\w\W]*?)
31(?=",|"$)
32|
33(,(?!")|^(?!"))
34([^,]*?|)
35(?=$|,)
36
(,"|^")(""|[\w\W]*?)(?=",|"$)
This part is to match quoted strings, that seem to work for you
Going through this part (,(?!")|^(?!"))([^,]*?|)(?=$|,)
(,(?!")|^(?!"))
start with comma not followed by " OR start of line not followed by "
([^,]*?|)
Start of line or comma zero or more non greedy and |, why |
(?=$|,)
end of line or , .
In CSV this ,,,3,4,5
line should give 6 matches but the above only gets 5
You could add (^(?=,))
at the begining of second part, the part that matches non quoted sections.
Second group with match of start and also added non capture to groups
(?:^(?=,))|(?:,(?!")|^(?!"))(?:[^,]*?)(?=$|,)
Complete: (?:,"|^")(?:""|[\w\W]*?)(?=",|"$)|(?:^(?=,))|(?:,(?!")|^(?!"))(?:[^,]*?)(?=$|,)
Here is another that might work
(?:(?:"(?:[^"]|"")*"|(?<=,)[^,]*(?=,))|^[^,]+|^(?=,)|[^,]+$|(?<=,)$)
How that works i described here: Build CSV parser using regex
Community Discussions contain sources that include Stack Exchange Network
Tutorials and Learning Resources in CSV Processing
Tutorials and Learning Resources are not available at this moment for CSV Processing