datasets | largest hub of ready-to-use datasets | Dataset library

 by   huggingface Python Version: 2.14.5 License: Apache-2.0

kandi X-RAY | datasets Summary

kandi X-RAY | datasets Summary

datasets is a Python library typically used in Artificial Intelligence, Dataset, Deep Learning, Pytorch, Numpy applications. datasets has no bugs, it has no vulnerabilities, it has build file available, it has a Permissive License and it has medium support. You can install using 'pip install datasets' or download it from GitHub, PyPI.

Datasets also provides access to +15 evaluation metrics and is designed to let the community easily add and share new datasets and evaluation metrics.

            kandi-support Support

              datasets has a medium active ecosystem.
              It has 16438 star(s) with 2215 fork(s). There are 271 watchers for this library.
              There were 10 major release(s) in the last 6 months.
              There are 552 open issues and 1710 have been closed. On average issues are closed in 11 days. There are 77 open pull requests and 0 closed requests.
              It has a neutral sentiment in the developer community.
              The latest version of datasets is 2.14.5

            kandi-Quality Quality

              datasets has 0 bugs and 0 code smells.

            kandi-Security Security

              datasets has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.
              datasets code analysis shows 0 unresolved vulnerabilities.
              There are 0 security hotspots that need review.

            kandi-License License

              datasets is licensed under the Apache-2.0 License. This license is Permissive.
              Permissive licenses have the least restrictions, and you can use them in most projects.

            kandi-Reuse Reuse

              datasets releases are available to install and integrate.
              Deployable package is available in PyPI.
              Build file is available. You can build the component from source.
              Installation instructions, examples and code snippets are available.
              datasets saves you 84069 person hours of effort in developing the same functionality from scratch.
              It has 147459 lines of code, 5688 functions and 964 files.
              It has medium code complexity. Code complexity directly impacts maintainability of the code.

            Top functions reviewed by kandi - BETA

            kandi has reviewed datasets and discovered the below as its top functions. This is intended to give you an instant insight into datasets implemented functionality, and help decide if they suit your requirements.
            • Download and prepare and prepare files
            • Download and prepare data for all splits
            • Check if manual data requires manual data
            • Check if the filesystem is a remote file system
            • Push shard shards to hub
            • Create a repository
            • Sharded dataset
            • Push parquet shards to hub
            • Add a FAiss index
            • Align the labels with the given mapping
            • Sort the Dataset
            • Return a Dataset based on a function
            • Run the builder
            • Shuffle dataset
            • Renames a column
            • Renames columns
            • Returns an iterator over the examples in the dataset
            • Sort dataset by column
            • Add an elasticsearch index
            • Build a single dataset
            • Return a YAML representation of the feature
            • Encodes a column
            • Shuffle the dataset
            • Save the dataset to disk
            • Return a new Dataset with the given function
            • Runs the tool
            Get all kandi verified functions for this library.

            datasets Key Features

            No Key Features are available at this moment for datasets.

            datasets Examples and Code Snippets

            copy iconCopy
            'images': [
                    'file_name': 'COCO_val2014_000000001268.jpg',
                    'height': 427,
                    'width': 640,
                    'id': 1268
            'annotations': [
                    'segmentation': [[192.81,
            copy iconCopy
            1280 720
            10 20 40 60 1
            20 40 50 60 2
            1280 720
            50 20 40 60 2
            20 40 30 45 2
            30 40 50 60 3
            import mmcv
            import numpy as np
            from .builder import DATASETS
            from .custom import CustomDataset
            copy iconCopy
            ├── mmdet
            ├── tools
            ├── configs
            ├── data
            │   ├── coco
            │   │   ├── annotations
            │   │   ├── train2017
            │   │   ├── val2017
            │   │   ├── test2017
            │   ├── cityscapes
            │   │   ├── annotations
            │   │   ├── leftImg8bit
            │   │   │   ├── train
            │   │     
            Distribute datasets from a function .
            pythondot img4Lines of Code : 78dot img4License : Non-SPDX (Apache License 2.0)
            copy iconCopy
            def distribute_datasets_from_function(self, dataset_fn, options=None):
                # pylint: disable=line-too-long
                """Distributes `` instances created by calls to `dataset_fn`.
                The argument `dataset_fn` that users pass in is an input   
            Creates a list of Datasets from a function .
            pythondot img5Lines of Code : 65dot img5License : Non-SPDX (Apache License 2.0)
            copy iconCopy
            def get_distributed_datasets_from_function(dataset_fn,
            Sample from datasets .
            pythondot img6Lines of Code : 61dot img6License : Non-SPDX (Apache License 2.0)
            copy iconCopy
            def sample_from_datasets_v2(datasets,
              """Samples elements at random from the datasets in `datasets`.
            copy iconCopy
            ner = pipeline("ner", aggregation_strategy="simple", model="dbmdz/bert-large-cased-finetuned-conll03-english")  # Named Entity Recognition (NER)
            Pandas Dataframe display total
            Pythondot img8Lines of Code : 2dot img8License : Strong Copyleft (CC BY-SA 4.0)
            copy iconCopy
            df1['Total']=df1.groupby('State')['Product'].transform(lambda x: x.count())  
            Get all images of a multi-frame DICOM file
            Pythondot img9Lines of Code : 6dot img9License : Strong Copyleft (CC BY-SA 4.0)
            copy iconCopy
            for i, slice in enumerate(ds.pixel_array):
            How to avail "Forecasting: Methods and Application" dataset in Python?
            Pythondot img10Lines of Code : 32dot img10License : Strong Copyleft (CC BY-SA 4.0)
            copy iconCopy
            ## install and load package:
            ## list example data of package fma:
            data(package = 'fma')
            ## export single data as csv:
            write.csv(cement, file = 'cement.csv')
            ## bulk export:
            ## data names are in `[,

            Community Discussions


            Shap - The color bar is not displayed in the summary plot
            Asked 2022-Apr-05 at 00:40

            When displaying summary_plot, the color bar does not show.



            Answered 2021-Dec-26 at 21:17

            I had the same problem as you did, and I found that the solution was to downgrade matplotlib to 3.4.3.. It appears SHAP isn't optimized for matplotlib 3.5.1 yet.



            react-chartjs-2 with chartJs 3: Error "arc" is not a registered element
            Asked 2022-Mar-09 at 11:20

            I am working on a React app where i want to display charts. I tried to use react-chartjs-2 but i can't find a way to make it work. when i try to use Pie component, I get the error: Error: "arc" is not a registered element.

            I did a very simple react app:

            • npx create-react-app my-app
            • npm install --save react-chartjs-2 chart.js

            Here is my package.json:



            Answered 2021-Nov-24 at 15:13

            Chart.js is treeshakable since chart.js V3 so you will need to import and register all elements you are using.



            AttributeError: Can't get attribute 'new_block' on
            Asked 2022-Feb-25 at 13:18

            I was using pyspark on AWS EMR (4 r5.xlarge as 4 workers, each has one executor and 4 cores), and I got AttributeError: Can't get attribute 'new_block' on . Below is a snippet of the code that threw this error:



            Answered 2021-Aug-26 at 14:53

            I had the same error using pandas 1.3.2 in the server while 1.2 in my client. Downgrading pandas to 1.2 solved the problem.



            Tensorflow setup on RStudio/ R | CentOS
            Asked 2022-Feb-11 at 09:36

            For the last 5 days, I am trying to make Keras/Tensorflow packages work in R. I am using RStudio for installation and have used conda, miniconda, virtualenv but it crashes each time in the end. Installing a library should not be a nightmare especially when we are talking about R (one of the best statistical languages) and TensorFlow (one of the best deep learning libraries). Can someone share a reliable way to install Keras/Tensorflow on CentOS 7?

            Following are the steps I am using to install tensorflow in RStudio.

            Since RStudio simply crashes each time I run tensorflow::tf_config() I have no way to check what is going wrong.



            Answered 2022-Jan-16 at 00:08

            Perhaps my failed attempts will help someone else solve this problem; my approach:

            • boot up a clean CentOS 7 vm
            • install R and some dependencies



            Configuring compilers on Mac M1 (Big Sur, Monterey) for Rcpp and other tools
            Asked 2022-Feb-10 at 21:07

            I'm trying to use packages that require Rcpp in R on my M1 Mac, which I was never able to get up and running after purchasing this computer. I updated it to Monterey in the hope that this would fix some installation issues but it hasn't. I tried running the Rcpp check from this page but I get the following error:



            Answered 2022-Feb-10 at 21:07

            Currently (2022-02-05), CRAN builds R binaries for Apple silicon using Apple clang (from Command Line Tools for Xcode 12.4) and an experimental build of gfortran.

            If you obtain R from CRAN (i.e., here), then you need to replicate CRAN's compiler setup on your system before building R packages that contain C/C++/Fortran code from their sources (and before using Rcpp, etc.). This requirement ensures that your package builds are compatible with R itself.

            A further complication is the fact that Apple clang doesn't support OpenMP, so you need to do even more work to compile programs that make use of multithreading. You could circumvent the issue by building R itself and all R packages from sources with LLVM clang, which does support OpenMP, but this approach is onerous and "for experts only". There is another approach that has been tested by a few people, including Simon Urbanek, the maintainer of R for macOS. It is experimental and also "for experts only", but seems to work on my machine and is simpler than trying to build R yourself.

            Instructions for obtaining a working toolchain

            Warning: These instructions come with no warranty and could break at any time. They assume some level of familiarity with C/C++/Fortran program compilation, Makefile syntax, and Unix shells. As usual, sudo at your own risk.

            I will try to address compilers and OpenMP support at the same time. I am going to assume that you are starting from nothing. Feel free to skip steps you've already taken, though you might find a fresh start helpful.

            I've tested these instructions on a machine running Big Sur, and at least one person has tested them on a machine running Monterey. I would be glad to hear from others.

            1. Download an R binary from CRAN here and install. Be sure to select the binary built for Apple silicon.

            2. Run



            How to automate legends for a new geom in ggplot2?
            Asked 2022-Jan-30 at 18:08

            I've built this new ggplot2 geom layer I'm calling geom_triangles (see that plots isosceles triangles given aesthetics including x, y, z where z is the height of the triangle and the base of the isosceles triangle has midpoint (x,y) on the graph.

            What I want is for the geom_triangles() layer to automatically provide legend components for the height and width of the triangles, but I am not sure how to do that.

            I understand based on this reference that I may need to adjust the draw_key argument in the ggproto StatTriangles object, but I'm not sure how I would do that and can't seem to find examples online of how to do it. I've been looking at the source code in ggplot2 for the draw_key functions, but I'm not sure how I would introduce multiple legend components (one for each of height and width) in a single draw_key argument in the StatTriangles ggproto.



            Answered 2022-Jan-30 at 18:08

            I think you might be slightly overcomplicating things. Ideally, you'd just want a single key drawing method for the whole layer. However, because you're using a Stat to do the majority of calculations, this becomes hairy to implement. In my answer, I'm avoiding this.

            Let's say I'd want to use a geom-only implementation of such a layer. I can make the following (simplified) class/constructor pair. Below, I haven't bothered width_scale or height_scale parameters, just for simplicity.




            Is it possible to use a collection of hyperspectral 1x1 pixels in a CNN model purposed for more conventional datasets (CIFAR-10/MNIST)?
            Asked 2021-Dec-17 at 09:08

            I have created a working CNN model in Keras/Tensorflow, and have successfully used the CIFAR-10 & MNIST datasets to test this model. The functioning code as seen below:



            Answered 2021-Dec-16 at 10:18

            If the hyperspectral dataset is given to you as a large image with many channels, I suppose that the classification of each pixel should depend on the pixels around it (otherwise I would not format the data as an image, i.e. without grid structure). Given this assumption, breaking up the input picture into 1x1 parts is not a good idea as you are loosing the grid structure.

            I further suppose that the order of the channels is arbitrary, which implies that convolution over the channels is probably not meaningful (which you however did not plan to do anyways).

            Instead of reformatting the data the way you did, you may want to create a model that takes an image as input and also outputs an "image" containing the classifications for each pixel. I.e. if you have 10 classes and take a (145, 145, 200) image as input, your model would output a (145, 145, 10) image. In that architecture you would not have any fully-connected layers. Your output layer would also be a convolutional layer.

            That however means that you will not be able to keep your current architecture. That is because the tasks for MNIST/CIFAR10 and your hyperspectral dataset are not the same. For MNIST/CIFAR10 you want to classify an image in it's entirety, while for the other dataset you want to assign a class to each pixel (while most likely also using the pixels around each pixel).

            Some further ideas:

            • If you want to turn the pixel classification task on the hyperspectral dataset into a classification task for an entire image, maybe you can reformulate that task as "classifying a hyperspectral image as the class of it's center (or top-left, or bottom-right, or (21th, 104th), or whatever) pixel". To obtain the data from your single hyperspectral image, for each pixel, I would shift the image such that the target pixel is at the desired location (e.g. the center). All pixels that "fall off" the border could be inserted at the other side of the image.
            • If you want to stick with a pixel classification task but need more data, maybe split up the single hyperspectral image you have into many smaller images (e.g. 10x10x200). You may even want to use images of many different sizes. If you model only has convolution and pooling layers and you make sure to maintain the sizes of the image, that should work out.



            Draw a horizontal and vertical line on mouse hover in chart js
            Asked 2021-Dec-08 at 12:29

            I am stuck with a problem on chart js while creating line chart. I want to create a chart with the specified data and also need to have horizontal and vertical line while I hover on intersection point. I am able to create vertical line on hover but can not find any solution where I can draw both the line. Here is my code to draw vertical line on hover.



            Answered 2021-Dec-06 at 04:46

            I have done exactly this (but vertical line only) in a previous version of one of my projects. Unfortunately this feature has been removed but the older source code file can still be accessed via my github.

            The key is this section of the code:



            react-chartjs-2 fill property not working?
            Asked 2021-Dec-07 at 09:30

            I want to add fill to a line chart using the react-chartjs-2 package. I'm passing fill: true to the dataset but that doesn't work as expected. Any suggestions?



            Answered 2021-Dec-07 at 09:30

            This is because you are using treeshaking and not importing/registering the filler plugin.



            "Back engineering" an R package from compiled binary version
            Asked 2021-Nov-23 at 21:17

            I work for an org that has a number of internal packages that were created many years ago. These are in the form of package zip archives that were compiled on Windows on R 3.x. Therefore, they can't be installed on R 4.x, and can't be used on Macs or Linux either without being recompiled. So everyone in the entire org is stuck on R 3.6 until this is resolved. I don't have access to the original package source files. They are lost to time....

            I want to take these packages, extract the code and data, and update them for modern best practices (roxygen, GitHub repos, testthat etc.). What is the best way of doing this? I have a fair amount of experience with package development. I have already tackled one. I started a new RStudio package project, and going function by function, copying the function code to a new script file, getting and reformatting the help from the help browser as roxygen docs. I've done the same for any internal hidden functions that i could find (via pkg_name::: mostly) , and also the internal datasets. That is all fairly straightforward, but very time consuming. It builds ok, but I haven't yet tested the actual functionality of the code.

            I'm currently stuck because there are a couple of standardGeneric method functions for custom S4 class objects. I am completely unfamiliar with these and haven't been able to figure out how to copy them over. Viewing the source code they are wrapped in new() with "standardGeneric" as the first argument (plus a lot more obviously), as opposed to just being a simple function definition for all the other functions. Any help with how to recreate or copy these over would be very welcome.

            But maybe I am going about this the wrong way in the first place. I haven't been able to find any helpful suggestions about how to "back engineer" R package source files from a compiled version.

            Anyone any ideas?



            Answered 2021-Nov-15 at 15:23

            Check out if this works in R 3.6.

            Below script can automate least part of your problem by writing all function sources into separate and appropriately named .R files. This code will also take care of hidden functions.

            Extracting code


            Community Discussions, Code Snippets contain sources that include Stack Exchange Network


            No vulnerabilities reported

            Install datasets

            If you plan to use 🤗 Datasets with PyTorch (1.0+), TensorFlow (2.2+) or pandas, you should also install PyTorch, TensorFlow or pandas. For more details on using the library with NumPy, pandas, PyTorch or TensorFlow, check the quick start page in the documentation:


            For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .
            Find more information at:

            Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items

            Find more libraries
          • PyPI

            pip install datasets

          • CLONE
          • HTTPS


          • CLI

            gh repo clone huggingface/datasets

          • sshUrl


          • Stay Updated

            Subscribe to our newsletter for trending solutions and developer bootcamps

            Agree to Sign up and Terms & Conditions

            Share this Page

            share link