split-folders | Split folders with files | Machine Learning library
kandi X-RAY | split-folders Summary
kandi X-RAY | split-folders Summary
Split folders with files (e.g. images) into train, validation and test (dataset) folders.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
- Parse command line arguments
- Copy files from a fixed directory
- Group files by prefix
- Copy the files contained in the input directory
- Split the class directory with fixed parameters
- Copy files_type to output directory
- Return a list of training and validation files
- Checks input folder
- Return a list of the files in the class directory
- Splits the class directory with the given ratio
- List all files in a directory
- List all directories in a directory
split-folders Key Features
split-folders Examples and Code Snippets
Community Discussions
Trending Discussions on split-folders
QUESTION
I'm trying to split my image dataset so it can have a training set and validation set. I found this Python's library called split-folders. The syntax is easy to understand
splitfolders.ratio("input_folder", output="output", seed=1337, ratio=(.8, .1, .1), group_prefix=None)
But I don't know about this seed parameter and what it does. The description on the page only says that "a seed makes splits reproducible" and that "it shuffles the items" but it doesn't really explain anything for me. I have googled about it and none of them gave me a clear answer. Anyone can give me a brief explanation?
The default number is 1337, but why? What does it mean to have the seed set to 1337? How did they come up with that number? How do I find the correct seed for my dataset?
...ANSWER
Answered 2021-May-19 at 05:34When you split your corpus to train, validate, and test set, you randomly assign one data point to one of these three sets. Randomness is traceable using seeds.
Imagine, you have a random generator, a BlackBox, that gives you a series of random numbers; But for each given seed, the sequence it generates will be always identical. For example, for seed=1337, a random generator will always generate a sequence of random numbers like 12,901,110,1,.... on the same computer.
Why we care about tracing the randomness, especially in the case of dividing the corpus for training? Because most of the time, you want to repeat the same experiment, with the same data. So if you do not use the seed value, each time you run the same experiment, you will end up with different settings for training.
The seed value itself is not important, as long as you get it by some value you know it is fixed during your experiments. I personally set it to a prime number.
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install split-folders
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page