cloud-annotations | π A fast , easy and collaborative open source image | Data Labeling library
kandi X-RAY | cloud-annotations Summary
Support
Quality
Security
License
Reuse
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample Here
cloud-annotations Key Features
cloud-annotations Examples and Code Snippets
Trending Discussions on Data Labeling
Trending Discussions on Data Labeling
QUESTION
I'm trying to make a data labeling in a table, and I need to do it in such a way that, in each row, the index is repeated, however, that in each column there is another Enum class.
What I've done so far is make this representation with the same enumerator class.
A solution using the column separately as a list would also be possible. But what would be the best way to resolve this?
import pandas as pd
from enum import Enum
df = pd.DataFrame({'first': ['product and other', 'product2 and other', 'price'], 'second':['product and prices', 'price2', 'product3 and price']})
df
class Tipos(Enum):
B = 1
I = 2
L = 3
for index, row in df.iterrows():
sentencas = row.values
for sentenca in sentencas:
for pos, palavra in enumerate(sentenca.split()):
print(f"{palavra} {Tipos(pos+1).name}")
Results:
first second
0 product and other product and prices
1 product2 and other price2
2 price product3 and price
product B
and I
other L
product B
and I
prices L
product2 B
and I
other L
price2 B
price B
product3 B
and I
price L
Desired Results:
Word Ent
0 product B_first
1 and I_first
2 other L_first
3 product B_second
4 and I_second
5 prices L_second
6 product2 B_first
7 and I_first
8 other L_first
9 price2 B_second
10 price B_first
11 product3 B_second
12 and I_second
13 price L_second
# In that case, the sequence is like that: (B_first, I_first, L_first, L_first...) and if changes the column gets B_second, I_second, L_second...
ANSWER
Answered 2021-Dec-30 at 13:57Instead of using Enum
you can use a dict
mapping. You can avoid loops if you flatten your dataframe:
out = df.unstack().str.split().explode().sort_index(level=1).to_frame('Word')
out['Ent'] = out.groupby(level=[0, 1]).cumcount().map(Tipos) \
+ '_' + out.index.get_level_values(0)
out = out.reset_index(drop=True)
Output:
>>> out
Word Ent
0 product B_first
1 and I_first
2 other L_first
3 product B_second
4 and I_second
5 prices L_second
6 product2 B_first
7 and I_first
8 other L_first
9 price2 B_second
10 price B_first
11 product3 B_second
12 and I_second
13 price L_second
QUESTION
I have a dataframe that contains a column that includes strings separeted with semi-colons and it is followed by a space. But unfortunately in some of the strings there is a semi-colon that is not followed by a space.
In this case, This is what i'd like to do: If there is a space after the semi-colon we do not need a change. However if there are letters before and after the semi-colon, we should change semi-colon with space
i have this:
datacolumn1
row 1 knowledge; information; data
row 2 digital;transmission; interoperability; data labeling
row 3 library catalogs; libraries; mobile;libraries
I need this output:
datacolumn1
row 1 knowledge; information; data
row 2 digital transmission; interoperability; data labeling
row 3 library catalogs; libraries; mobile libraries
ANSWER
Answered 2020-Nov-16 at 07:24Try something like:
library(stringr)
str_replace_all(datacolumn1, "(\\w);(\\w)", "\\1 \\2")
This is probably a neater regex out there, but this will do!
QUESTION
Objective: Generate a down-sampled FileDataset using random sampling from a larger FileDataset to be used in a Data Labeling project.
Details: I have a large FileDataset containing millions of images. Each filename contains details about the 'section' it was taken from. A section may contain thousands of images. I want to randomly select a specific number of sections and all the images associated with those sections. Then register the sample as a new dataset.
Please note that the code below is not a direct copy and paste as there are elements such as filepaths and variables that have been renamed for confidentiality reasons.
import azureml.core
from azureml.core import Dataset, Datastore, Workspace
# Load in work space from saved config file
ws = Workspace.from_config()
# Define full dataset of interest and retrieve it
dataset_name = 'complete_2017'
data = Dataset.get_by_name(ws, dataset_name)
# Extract file references from dataset as relative paths
rel_filepaths = data.to_path()
# Stitch back in base directory path to get a list of absolute paths
src_folder = '/raw-data/2017'
abs_filepaths = [src_folder + path for path in rel_filepaths]
# Define regular expression pattern for extracting source section
import re
pattern = re.compile('\/(S.+)_image\d+.jpg')
# Create new list of all unique source sections
sections = sorted(set([m.group(1) for m in map(pattern.match, rel_filepaths) if m]))
# Randomly select sections
num_sections = 5
set_seed = 221020
random.seed(set_seed) # for repeatibility
sample_sections = random.choices(sections, k = num_sections)
# Extract images related to the selected sections
matching_images = [filename for filename in abs_filepaths if any(section in filename for section in sample_sections)]
# Define datastore of interest
datastore = Datastore.get(ws, 'ml-datastore')
# Convert string paths to Azure Datapath objects and relate back to datastore
from azureml.data.datapath import DataPath
datastore_path = [DataPath(datastore, filepath) for filepath in matching_images]
# Generate new dataset using from_files() and filtered list of paths
sample = Dataset.File.from_files(datastore_path)
sample_name = 'random-section-sample'
sample_dataset = sample.register(workspace = ws, name = sample_name, description = 'Sampled sections from full dataset using set seed.')
Issue: The code I've written in Python SDK runs and the new FileDataset registers, but when I try to look at the dataset details or use it for a Data Labeling project I get the following error even as Owner.
Access denied: Failed to authenticate data access with Workspace system assigned identity. Make sure to add the identity as Reader of the data service.
Additionally, under the details tab Files in dataset is Unknown and Total size of files in dataset is Unavailable.
I haven't come across this issue anywhere else. I'm able to generate datasets in other ways, so I suspect it's an issue with the code given that I'm working with the data in an unconventional way.
Additional Notes:
- Azure ML version is 1.15.0
ANSWER
Answered 2020-Oct-27 at 22:39Is the data behind virtual network by any chance?
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install cloud-annotations
Support
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesExplore Kits - Develop, implement, customize Projects, Custom Functions and Applications with kandi kitsβ
Save this library and start creating your kit
Share this Page