disk.frame | Fast Disk-Based Parallelized Data Manipulation Framework
kandi X-RAY | disk.frame Summary
kandi X-RAY | disk.frame Summary
How do I manipulate tabular data that doesn’t fit into Random Access Memory (RAM)?. In a nutshell, {disk.frame} makes use of two simple ideas. {disk.frame} performs a similar role to distributed systems such as Apache Spark, Python’s Dask, and Julia’s JuliaDB.jl for medium data which are datasets that are too large for RAM but not quite large enough to qualify as big data.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of disk.frame
disk.frame Key Features
disk.frame Examples and Code Snippets
Community Discussions
Trending Discussions on disk.frame
QUESTION
ANSWER
Answered 2022-Mar-28 at 10:14The function disk.frame
reads in an existing disk.frame folder.
QUESTION
I'm currently trying to write a function that filters some rows of a disk.frame
object using regular expressions. I, unfortunately, run into some issues with the evaluation of my search string in the filter function. My idea was to pass a regular expression as a string into a function argument (e.g. storm_name
) and then pass that argument into my filtering call. I used the %like%
function included in {data.table}
for filtering rows.
My problem is that the storm_name
object gets evaluated inside the disk.frame. However, since the storm_name
is only included in the function environment, but not in the disk.frame object, I get the following error:
ANSWER
Answered 2022-Jan-20 at 17:38While I don't know the exact cause of this, it has to do with environments, search path, etc. For instance, these work:
QUESTION
According to the article https://diskframe.com/articles/ingesting-data.html a good use case for inmapfn as part of csv_to_disk_frame(...)
is for date conversion. In my data I know the name of the date column at runtime and would like to feed in the date to a convert at read in time function. One issue I am having is that it doesn't seem any additional parameters can be passed into the inmapfn argument beyond the chunk itself. I can't use a hardcoded variable at runtime as the name of the column isn't known until runtime.
To clarify the issue is that the inmapfn seems to run in its own environment to prevent any data races/other parallelisation issues but I know the variable won't be changed so I am hoping there is someway to override this as I can make sure that this is safe.
I know the function I am calling works when called on an arbitrary dataframe.
I have provided a reproducible example below.
...ANSWER
Answered 2021-Oct-14 at 16:51You can experiment with different backend
and chunk_reader
arguments. For example, if you set the backend
to readr
, the inmapfn
user defined function will have access to previously defined variables. Furthermore, readr
will do column type guessing
and will automatically impute Date type columns if it recognizes the string format as a date (in your example data it wouldn't recognize that as a date type, however).
If you don't want to use the readr backend for performance reasons, then I would ask if your example correctly represents your actual scenario? I'm not seeing the need to pass in the date column as a variable in the example you provided.
There is a working solution in the Just-in-time transformation section of the link you provided, and I'm not seeing any added complexities between that example and yours.
If you really need to use the default backend
and chunk_reader
plan AND you really need to send the inmapfn function a previously defined variable, you can wrap the the csv_to_disk.frame
call in a wrapper function:
QUESTION
I'm getting this error when trying to import CSVs using this code:
some.df = csv_to_disk.frame(list.files("some/path"))
Error in split_every_nlines(name_in = normalizePath(file, mustWork = TRUE), : Expecting a single string value: [type=character; extent=3].
I got a temporary solution with a for loop that iterated through each of the files and then I rbinded all the disk frames together.
I pulled the code from the ingesting data doc
...ANSWER
Answered 2020-Sep-20 at 06:09This seems to be an error triggered by the bigreadr
package. I wonder if you have a way to reproduce the chunks.
Or maybe try a different chunk reader,
QUESTION
I have a disk frame that I've saved into a file. It's made up of ten chunks.
I coded every one of the columns as a character because I intend on combining these individual disk frames into one large disk frame and setting the column types at that point.
I wanted to pull the disk frame from it's file with this code
...ANSWER
Answered 2020-Sep-19 at 01:32In case someone gets the same error, it means that you have the wrong pathname.
QUESTION
I'm looking through the docs and I don't see a function for writing to CSV.
It appears there's a function for writing the disk frame, but it's unclear what format it gets stored in
write_disk.frame
Write a data.frame/disk.frame to a disk.frame location. If df is a data.frame then using the as.disk.framefunction is recommended for most cases
Can I use fwrite
or write_csv
with a disk frame?
ANSWER
Answered 2020-Sep-17 at 02:39I see. I might add the write to csv functionality as I see this request quite often.
The best way to keep track though is to submit an issue on github https://github.com/xiaodaigh/disk.frame/issues I have done that this time see https://github.com/xiaodaigh/disk.frame/issues/311
If you want to write each chunk to a separate CSV just do
QUESTION
I saved a disk frame to its output directory and then restarted my R session.
I'd like to read the existing disk frame instead of recreating it elsewhere.
How might I be able to accomplish this? My folder is called outdir.df
This is how I saved the disk frame
...ANSWER
Answered 2020-Sep-11 at 19:37I think disk.frame
's preferred method is to open a reference to the disk location, using
QUESTION
I'd like to convert a data frame to a disk frame and then count the first column. It's not counting the number of unique values of the column when I try it. It appears to be counting the number of workers.
...ANSWER
Answered 2020-Sep-09 at 00:59{disk.frame} only supports some group-by functions. You can use dplyr::n_distinct
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install disk.frame
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page