ijson | efficient alternative to serde_json : : Value | JSON Processing library
kandi X-RAY | ijson Summary
kandi X-RAY | ijson Summary
This crate offers a replacement for serde-json's Value type, which is significantly more memory efficient. As a ballpark figure, it will typically use half as much memory as serde-json when deserializing a value and the memory footprint of cloning a value is more than 7x smaller.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of ijson
ijson Key Features
ijson Examples and Code Snippets
Community Discussions
Trending Discussions on ijson
QUESTION
I have thousands of very large JSON files that I need to process on specific elements. To avoid memory overload I am using a python library called ijson which works fine when I am processing only a single element from the json file but when I try to process multiple-element at once it throughs
IncompleteJSONError: parse error: premature EOF
Partial JSON:
...ANSWER
Answered 2021-Dec-04 at 12:58I think this is happening because you've finished reading your IO stream from the file, you're at the end already, and already asking for another query.
What you can do is to reset the cursor to the 0 position before the second query:
QUESTION
I am new to python. So please excuse me if I am not asking the questions in pythonic way.
My requirements are as follows:
I need to write python code to implement this requirement.
Will be reading 60 json files as input. Each file is approximately 150 GB.
Sample structure for all 60 json files is as shown below. Please note each file will have only ONE json object. And the huge size of each file is because of the number and size of the "array_element" array contained in that one huge json object.
{ "string_1":"abc", "string_1":"abc", "string_1":"abc", "string_1":"abc", "string_1":"abc", "string_1":"abc", "array_element":[] }
Transformation logic is simple. I need to merge all the array_element from all 60 files and write it into one HUGE json file. That is almost 150GB X 60 will be the size of the output json file.
Questions for which I am requesting your help on:
For reading: Planning on using "ijson" module's ijson.items(file_object, "array_element"). Could you please tell me if ijson.items will "Yield" (that is NOT load the entire file into memory) one item at a time from "array_element" array in the json file? I dont think json.load is an option here because we cannot hold such a huge dictionalry in-memory.
For writing: I am planning to read each item using ijson.item, and do json.dumps to "encode" and then write it to the file using file_object.write and NOT using json.dump since I cannot have such a huge dictionary in memory to use json.dump. Could you please let me know if f.flush() applied in the code shown below is needed? To my understanding, the internal buffer will automatically get flushed by itself when it is full and the size of the internal buffer is constant and wont dynamically grow to an extent that it will overload the memory? please let me know
Are there any better approach to the ones mentioned above for incrementally reading and writing huge json files?
Code snippet showing above described reading and writing logic:
...ANSWER
Answered 2021-Oct-31 at 14:18The following program assumes that the input files have a format that is predictable enough to skip JSON parsing for the sake of performance.
My assumptions, inferred from your description, are:
- All files have the same encoding.
- All files have a single position somewhere at the start where
"array_element":[
can be found, after which the "interesting portion" of the file begins - All files have a single position somewhere at the end where
]}
marks the end of the "interesting portion" - All "interesting portions" can be joined with commas and still be valid JSON
When all of these points are true, concatenating a predefined header fragment, the respective file ranges, and a footer fragment would produce one large, valid JSON file.
QUESTION
I'm using ijson
to parse through large JSONs. I have this code, which should give me a dict
of values corresponding to the relevant JSON fields:
ANSWER
Answered 2021-Oct-27 at 13:02Beware of how you are collecting the results from kvitems
. In all your examples above you are using a generator expression, which are themselves lazy-evaluated, and this may lead to misunderstandings. You are not showing however how you find that your final dictionary has values for id
but not for the other keys. I'm assuming it's only because you are iterating over the values under the parse_records['id']
values first. As you do so, the generator expression is then evaluated and the underlying kvitems
generator is exhausted. When you iterate over the values of the other generator expressions, the underlying kvitems
generator that feeds them is exhausted so they yield nothing. However, if you were to iterate over the values for one of the other keys first, you should see values for that key and not for the others.
Generator expressions themselves are great, but in this case it might end up adding confusion. If you want to avoid this situation you may want to consolidate those sequences to be lists instead (e.g., using [... for k, v in kvitems ...]
instead of (... for k, v in kvitems ...)
).
As you point out, kvitems
is a single-pass generator (or a single-pass asynchronous generator when fed with an asynchronous file-like object), so once you fully iterate over it, further iterations yield no values. This is why indeed in your original code you get values for id
but not for the other keys that are collected on subsequent iterations over an already-iterated kvitems
object.
Trying to duplicate the kvitems
object is also bogus: as you also found out, you are simply creating a list with the same object in all positions instead of copies of the original object.
Trying to copy
the kvitems
is simply not possible. The only option to get a N "copies" is to actually construct N different object; this means however that the input file will be read N times (and needs to be opened N times as well, as kvitems
will advance the given file until it doesn't have any more input). Possible, but not great.
The result of itertools.cycle
is an infinite generator. Then you use this as the basis to construct different generator expressions (so, lazy evaluated). You mention that this solution worked in ways "you don't understand", but don't delve on what exactly happened. My expectation is that when trying to inspect the values for any of the keys, you run into an infinite loop because your generator expression is iterating over an infinite generator, or something similar.
You say that your finally piece of code works as expected. This is the only bit that surprises me, specially if you really, really inspected (i.e., evaluated) all three of the generator expressions after you created them. If you could clarify if that's the case it would be interesting; otherwise if you created all three generator expressions, but then evaluated one or the other, then there are no surprises here (because of the "About result collection" explanation).
How to tackle your problemIt basically all boils down to doing a single iteration over kvitems
. You could try for instance something like this:
QUESTION
I have come across an error while parsing json with ijson.
Background:
I have a series(approx - 1000) of large files of twitter data that are compressed in a '.bz2' format. I need to get elements from the file into a pd.DataFrame
for further analysis. I have identified the keys I need to get. I am cautious putting twitter data up.
Attempt:
I have managed to decompress the files using bz2.decompress
with the following code:
ANSWER
Answered 2021-Oct-17 at 14:46To directly answer your two questions:
The decompression method is correct in the sense that it yields JSON data that you then feed to
ijson
. As you point out,ijson
works both withstr
andbytes
inputs (although the latter is preferred); if you were givingijson
some non-JSON input you wouldn't see an error showing JSON data in it.This is a very common error that is described in ijson's FAQ. It basically means your JSON document has more than one top-level value, which is not standard JSON, but is supported by ijson by using the
multiple_values
option (see docs for details).
About the code as a whole: while it's working correctly, it could be improved on: the whole point of using ijson
is that you can avoid loading the full JSON contents in memory. The code you posted doesn't use this to its advantage though: it first opens the bz-compressed file, reads it as a whole, decompresses that as a whole, (unnecessarily) decodes that as a whole, and then gives the decoded data as input to ijson
. If your input file is small, and the decompressed data is also small you won't see any impact, but if your files are big then you'll definitely start noticing it.
A better approach is to stream the data through all the operations so that everything happens incrementally: decompression, no decoding and JSON parsing. Something along the lines of:
QUESTION
I work with big geojson data (more than 1 Gb) with this structure. Is part of it.
...ANSWER
Answered 2021-Sep-15 at 09:56This answer works if you are sure that the data is a GeoJSON and it is structured properly:
For reading GeoJSON data you can use Geopandas
library:
QUESTION
I'm trying to parse a huge 12 GB JSON file with almost 5 million lines(each one is an object) in python and store it to a database. I'm using ijson and multiprocessing in order to run it faster. Here is the code
...ANSWER
Answered 2021-May-25 at 18:09I've had to make quite some extrapolations and assumptions, but it looks like
- you're using Django
- you want to populate an SQL database with venue, paper and author data
- you want to then do some analysis using Pandas
Populating your SQL database can be done pretty neatly with something like the following.
- I added the
tqdm
package so you get a progress indication. - This assumes there's a
PaperAuthor
model that links papers and authors. - Unlike the original code, this will not save duplicate
Venue
s in the database. - You can see I replaced
get_or_create
andcreate
with stubs to make this runnable without the database models (or indeed, without Django), just having the dataset you're using available.
On my machine, this consumes practically no memory, as the records are (or would be) dumped into the SQL database, not into an ever-growing, fragmenting dataframe in memory.
The Pandas processing is left as an exercise for the reader ;-), but I'd imagine it'd involve pd.read_sql()
to read this preprocessed data from the database.
QUESTION
I'm working with a web response of JSON that looks like this (simplified, and I can't change the format):
...ANSWER
Answered 2021-May-11 at 13:29You need to use ijson's event interception mechanism. Basically go one level down in the parsing logic by using ijson.parse
until you hit the big array, then switch to using ijson.items
with the rest of the parse
events. This uses a string literal, but should illustrate the point:
QUESTION
I have a json file just like this:
...ANSWER
Answered 2021-May-02 at 14:59I think if you need to keep track of CVE IDs and their corresponding CPEs you'll need to iterate over whole cve
items and extract the bits of data you need (so you'll only do one pass through the file). Not as efficient memory-wise as your original iteration, but if each item in CVE_Items
is not too big then it's not a problem:
QUESTION
I am parsing an extremely large JSON file using IJSON and then writing the contents to a temp file. Afterwards, I overwrite the original file with the contents of the temp file.
...ANSWER
Answered 2021-Feb-17 at 13:39have you tried this json.dump(row, temp, indent=4)
QUESTION
I have a large json data file with 3.7gb. Iam going to load the json file to dataframe and delete unused columns than convert it to csv and load to sql. ram is 40gb My json file structure
...ANSWER
Answered 2021-Feb-07 at 10:26Your proposal is:
- Step 1 read json file
- Step 2 load to dataframe
- Step 3 save file as a csv
- Step 4 load csv to sql
- Step 5 load data to django to search
The problem with your second example is that you still use global lists (data_phone
, data_name
), which grow over time.
Here's what you should try, for huge files:
- Step 1 read json
- line by line
- do not save any data into a global list
- write data directly into SQL
- Step 2 Add indexes to your database
- Step 3 use SQL from django
You don't need to write anything to CSV. If you really want to, you could simply write the file line by line:
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install ijson
Rust is installed and managed by the rustup tool. Rust has a 6-week rapid release process and supports a great number of platforms, so there are many builds of Rust available at any time. Please refer rust-lang.org for more information.
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page