PyTables | A Python package to manage extremely large amounts of data | Data Visualization library
kandi X-RAY | PyTables Summary
kandi X-RAY | PyTables Summary
A Python package to manage extremely large amounts of data
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of PyTables
PyTables Key Features
PyTables Examples and Code Snippets
.. _whatsnew_120.duplicate_labels:
Optionally disallow duplicate labels
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
:class:`Series` and :class:`DataFrame` can now be created with ``allows_duplicate_labels=False`` flag to
control whether the index or colu
.. _whatsnew_0150.cat:
Categoricals in Series/DataFrame
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
:class:`~pandas.Categorical` can now be included in ``Series`` and ``DataFrames`` and gained new
methods to manipulate. Thanks to Jan Schulz for much of this
.. _whatsnew_0240.enhancements.intna:
Optional integer NA support
^^^^^^^^^^^^^^^^^^^^^^^^^^^
pandas has gained the ability to hold integer dtypes with missing values. This long requested feature is enabled through the use of :ref:`extension types
df2.columns = df2.columns.str[0]
df3.columns = df3.columns.str[0]
out = pd.concat([df1, df2, df3])
out = pd.concat([df1, df2.rename(columns=lambda x:x[0]), df3.rename(columns=lambda x:x[0])])
out = pd.DataFrame(np.concatenate([df1.values,df2.values,df3.values]),columns=df1.columns)
Out[346]:
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
4 A4 B4 C4 D4
>>> data = "the quick brown fox jumps over the lazy dog"
>>> c = collections.Counter(data[i:i+2] for i in range(len(data)-2))
>>> max(c, key=c.get)
'th'
>>> c = collections.Counter(data[i:i+3] for i in r
merged = pd.merge(pos, mat_grp.drop_duplicates('matgrp'), how='left', on='matgrp')
PO matgrp commodity
0 123456 1001 foo - 10001
1 654321 803A spam - 100003
2 971358 803B eggs - 10003
SELECT oid, relnamespace::regnamespace AS schema, relname, relkind
FROM pg_class
WHERE relname ILIKE 'contacts';
REINDEX TABLE contacts;
REINDEX INDEX contacts_pkey; -- use actual name
e = Event()
...
e.nation = Nation.objects.get(name=nation_list[i])
e.save()
df = df.assign(g = df.groupby('personal id').cumcount()).pivot('personal id','g','label')
Community Discussions
Trending Discussions on PyTables
QUESTION
I have a table in pytables created as follows:
...ANSWER
Answered 2021-May-20 at 16:48Yes, that behavior is expected. Take a look at this answer to see more detailed example of the same behavior: How does HDF handle the space freed by deleted datasets without repacking. Note that the space will be reclaimed/reused if you add new datasets.
To reclaim the unused space in the file, you have to use a command line utility. There are 2 choices: ptrepack
and h5repack
: Both are used for a number of external file operations. To reduce file size after object deletion, create a new file from the old one as shown below:
ptrepack
utility delivered with PyTables.- Reference here: PyTables ptrepack doc
- Example:
ptrepack file1.h5 file2.h5
(creates file2.h5 from file1.h5)
h5repack
utility from The HDF Group.- Reference here: HDF5 h5repack doc
- Example:
h5repack [OPTIONS] file1.h5 file2.h5
(creates file2.h5 from file1.h5)
Both have options to use a different compression method when creating the new file, so are also handy if you want to convert from compressed to uncompressed (or vice versa)
QUESTION
I would like to get the byte contents of a pandas dataframe exported as hdf5, ideally without actually saving the file (i.e., in-memory).
On python>=3.6, < 3.9
(and pandas==1.2.4
, pytables==3.6.1
) the following used to work:
ANSWER
Answered 2021-May-18 at 12:59The fix was to do conda install -c conda-forge pytables
instead of pip install pytables
. I still don't understand the ultimate reason behind the error, though.
QUESTION
I would like to store a pandas DataFrame such that when I later load it again, I only load certain columns of it and not the entire thing. Therefore, I am trying to store a pandas DataFrame in hdf format. The DataFrame contains a numpy array and I get the following error message.
Any idea on how to get rid of the error or what format I could use instead?
CODE:
...ANSWER
Answered 2021-Apr-28 at 22:20Pandas seems to have trouble serializing the numpy array in your dataframe. So I would suggest storing the numpy
data in a seperate *.h5
file.
QUESTION
I have a problem with updating packages in conda. The list of my installed packages is:
...ANSWER
Answered 2021-Apr-14 at 20:26Channel pypi means that the package was installed with pip. You may need to upgrade it with pip as well
QUESTION
All of my virtual environments work fine, except for one in which the jupyter notebook won't connect for kernel. This environment has Zipline in it, so I expect there is some dependency that is a problem there, even though I installed all packages with Conda.
I've read the question and answers here, and unfortunately downgrading tornado to 5.1.1 didn't work nor do I get ValueErrors. I am, however, getting an AssertionError that appears related to the Class NSProcessInfo.
I'm on an M1 Mac. Log from terminal showing the error below, and my environment file is below that. Can someone help me get this kernel working? Thank you!
...ANSWER
Answered 2021-Apr-04 at 18:14Figured it out.
What works:
QUESTION
Created a new compute instance in Azure ML and trained a model with out any issue. I wanted to draw a pairplot using seaborn
but I keep getting the error "ImportError: No module named seaborn"
I ran !conda list
and I can see seaborn in the list
ANSWER
Answered 2020-Sep-07 at 04:17I just did the following and wasn't able to reproduce your error:
- make a new compute instance
- open it up using JupyterLab
- open a new terminal
conda activate azureml_py36
conda install seaborn -y
- open a new notebook and run
import seaborn as sns
- Are you using the kernel,
Python 3.6 - AzureML
(i.e. theazureml_py36
conda env)? - Have you tried restarting the kernel and/or creating a new compute instance?
QUESTION
I found that ~True
is -2
& ~False
is -1
using my Jupyter Notebook.
This source says that ~
invertes all the bits. Why isn't ~True
is False
and ~False
is True
?
My attempt to explain these:
True
is +1
, and the bits of +1
are inverted. +
is inverted to -
.
1
in two-digit binary is 01
, so inverted bits: 10
, ie 2
. So result is -2
.
False
is +0
, +
is inverted to -
, 0
in two-digit binary is 00
, all the bits inverted, 11
, which is 3
- it should be 1
.
This answer points a more complicated picture:
A list full of Trues only contains 4- or 8-byte references to the one canonical True object.
This source says:
bool: Boolean (true/false) types. Supported precisions: 8 (default) bits.
These don't support the simplistic (and apparently wrong) reasoning above.
The questionWhat is the proper explanation for ~True
being -2
& ~False
being -1
then?
ANSWER
Answered 2020-Aug-19 at 10:22First of all, I'd use the not operator to invert Boolean values (not True == False, and vice versa). Now if Booleans are stored as 8-bit integers, the following happens:
True is 0000 0001. Hence ~True yields 1111 1110, which is -2 in two's-complement representation.
False is 0000 0000. Hence ~False yields 1111 1111, which is -1.
QUESTION
I'm going to use rasterio in python. I downloaded rasterio via
...ANSWER
Answered 2020-Sep-22 at 05:37I've got some experience with rasterio, but I am not nearly a master with it. If I remember correctly, rasterio requires you to have installed the program GDAL(both binaries and python utilities), and some other dependencies listed on the PyPi page. I don't use conda at the moment, I like to use the regular python 3.8 installer with pip. Given what I'm seeing with your installation, I would uninstall rasterio and follow a different installation procedure.
I follow the instructions listed here: https://rasterio.readthedocs.io/en/latest/installation.html
This page also has separate instructions for those using Anaconda.
The GDAL installation is by far the most annoying but once it's done, the hard part is over. The python utilities for both rasterio and gdal can be found here:
https://www.lfd.uci.edu/~gohlke/pythonlibs/#gdal
The second link is also provided on the PyPi page but I like to keep it bookmarked because there's a lot of good resources there!
QUESTION
I'm trying to create a PyTables table to store 200000 * 200000 matrix in it. I try this code:
...ANSWER
Answered 2020-Aug-30 at 22:38That's a big matrix (300GB if all ints). Likely you will have to write incrementally. (I don't have enough RAM on my system to do it all at one.)
Without seeing your data types, it's hard to give specific advice.
First question: do you really want to create a Table or will an Array suffice?
PyTables has both types. What's the difference?
An Array holds homogeneous data (like a NumPy ndarray) and can have any dimension.
An Table is typically used to hold heterogeneous data (like a NumPy recarray) and is always 2d (really a 1d array of structured types). Tables also support complex queries with the PyTables API.
The key when creating a Table is to either use the description=
or obj=
parameter to describe the structured types (and field names) for each row. I recently posted an answer that shows how to create a Table. Please review. You may find you don't want to create 200000 fields/columns to define the Table. See this answer: different data types for different columns of an array
If you just want to save a matrix of 200000x200000 homogeneous entities, an array is easier. (Given the data size, you probably need to use an EArray, so you can write the data in increments.) I wrote a simple example that creates an EArray with 2000x200000 entities, then adds 3 more sets of data (each 2000 rows; total of 8000 rows).
- The
shape=(0,nrows)
parameter indicates the first axis can be extended, and createsncols
columns. - The
expectedrows=nrows
parameter is important in large datasets to improvie I/O performance.
The resulting HDF5 file is 6GB. Repeat earr.append(arr)
99 times to get 200000 rows. Code below:
QUESTION
In Python, I'm reading in a very large 2D grid of data that consists of around 200,000,000 data points in total. Each data point is a tuple of 3 floats. Reading all of this data into a two dimensional list frequently causes Memory Errors. To get around this Memory Error, I would like to be able to read this data into some sort of table on the hard drive that can be efficiently accessed when given a grid coordinate i.e harddrive_table.get(300, 42).
So far in my research, I've come across PyTables, which is an implementation of HDF5 and seems like overkill, and the built in shelve library, which uses a dictionary-like method to access saved data, but the keys have to be strings and the performance of converting hundreds of millions of grid coordinates to strings for storage could be too much of a performance hit for my use.
Are there any libraries that allow me to store a 2D table of data on the hard drive with efficient access for a single data point?
This table of data is only needed while the program is running, so I don't care about it's interoperability or how it stores the data on the hard drive as it will be deleted after the program has run.
...ANSWER
Answered 2020-Aug-26 at 03:22HDF5 isn't really overkill if it works. In addition to PyTables there's the somewhat simpler h5py.
Numpy lets you mmap a file directly into a numpy array. The values will be stored in the disk file in the minimum-overhead way, with the numpy array shape providing the mapping between array indices and file offsets. mmap uses the same underlying OS mechanisms that power the disk cache to map a disk file into virtual memory, meaning that the whole thing can be loaded into RAM if memory permits, but parts can be flushed to disk (and reloaded later on demand) if it doesn't all fit at once.
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install PyTables
You can use PyTables like any standard Python library. You will need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, and git installed. Make sure that your pip, setuptools, and wheel are up to date. When using pip it is generally recommended to install packages in a virtual environment to avoid changes to the system.
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page