Explore all Data Preparation open source software, libraries, packages, source code, cloud functions and APIs.

Explore Related Topics

Popular New Releases in Data Preparation

gcoord

v0.3.2

mltrace

Integrating Test functionality

pyclipper

1.2.1

mapbox-vector-tile

Leaflet.geoCSV

Stable

Popular Libraries in Data Preparation

topojson

by topojson doticonjavascriptdoticon

star image 4126 doticonNOASSERTION

An extension of GeoJSON that encodes topology! šŸŒ

gcoord

by hujiulong doticontypescriptdoticon

star image 2334 doticonMIT

地ē†åę ‡ē³»č½¬ę¢å·„具

carmen

by carmen-ruby doticonrubydoticon

star image 1132 doticonNOASSERTION

A repository of geographic regions for Ruby

world-atlas

by topojson doticonshelldoticon

star image 825 doticonISC

Pre-built TopoJSON from Natural Earth.

geojson

by jazzband doticonpythondoticon

star image 677 doticonBSD-3-Clause

Python bindings and utilities for GeoJSON

shapefile-js

by calvinmetcalf doticonjavascriptdoticon

star image 513 doticon

Convert a Shapefile to GeoJSON. Not many caveats.

tilelive

by mapbox doticonjavascriptdoticon

star image 480 doticonBSD-3-Clause

fast interface to tiles with pluggable backends - NOT ACTIVELY MAINTAINED

mltrace

by loglabs doticonpythondoticon

star image 369 doticonApache-2.0

Coarse-grained lineage and tracing for machine learning pipelines.

arc.js

by springmeyer doticonjavascriptdoticon

star image 317 doticonBSD-2-Clause

great circle routes in javascript

Trending New libraries in Data Preparation

mltrace

by loglabs doticonpythondoticon

star image 369 doticonApache-2.0

Coarse-grained lineage and tracing for machine learning pipelines.

GeoMapData_CN

by lyhmyd1211 doticonpythondoticon

star image 269 doticon

China province/city/country geoJSON data

krsh

by riiid doticonpythondoticon

star image 79 doticonApache-2.0

A declarative KubeFlow Management Tool

VectorTileExporter

by LilithWittmann doticonpythondoticon

star image 69 doticonMIT

Just some scripts to export vector tiles to geojson.

osm-geo-mapper

by edouardpoitras doticonrustdoticon

star image 33 doticonMIT

Navigate OpenStreetMap data in the terminal

check-geojson

by placemark doticontypescriptdoticon

star image 31 doticonMIT

a checker for the geojson format. goes beyond a schema, checking semantics and producing character-level warnings.

codequality

by gitlab-org/ci-cd doticonhtmldoticon

star image 24 doticon

Codequality jobs in pipelines https://docs.gitlab.com/ee/user/project/merge_requests/code_quality.html

country-levels

by hyperknot doticonpythondoticon

star image 20 doticonNOASSERTION

Full planet GeoJSON extracts, based on ISO and FIPS codes.

lookups

by 4lie doticongodoticon

star image 11 doticonMIT

Fast and in memory geo lookup library

Top Authors in Data Preparation

1

mapbox

18 Libraries

star icon1501

2

node-geojson

4 Libraries

star icon150

3

brandonxiang

4 Libraries

star icon108

4

perliedman

4 Libraries

star icon214

5

topojson

3 Libraries

star icon5092

6

ropensci-archive

3 Libraries

star icon12

7

nypl-spacetime

2 Libraries

star icon5

8

murphy214

2 Libraries

star icon8

9

openfusion-dev

2 Libraries

star icon10

10

turf-junkyard

2 Libraries

star icon4

1

18 Libraries

star icon1501

2

4 Libraries

star icon150

3

4 Libraries

star icon108

4

4 Libraries

star icon214

5

3 Libraries

star icon5092

6

3 Libraries

star icon12

7

2 Libraries

star icon5

8

2 Libraries

star icon8

9

2 Libraries

star icon10

10

2 Libraries

star icon4

Trending Kits in Data Preparation

No Trending Kits are available at this moment for Data Preparation

Trending Discussions on Data Preparation

Prepare for Binary Masks used for the image segmentation

Yolov5 object detection training

scatter plot color bar does not look right

using DelimitedFiles with title of the header

Pytorch : different behaviours in GAN training with different, but conceptually equivalent, code

Is there a way to query a csv file in Karate?

Multi Processing with sqlalchemy

Trouble changing imputer strategy in scikit-learn pipeline

Does tensorflow re-initialize weights when training in a for loop?

How to force Pytest to execute the only function in parametrize?

QUESTION

Prepare for Binary Masks used for the image segmentation

Asked 2022-Mar-30 at 08:33

I am trying to prepare the masks for image segmentation with Pytorch. I have three questions about data preparation.

  1. What is the appropriate data format to save the binary mask in general? PNG? JPEG?

  2. Is the mask size needed to be set square such as (224x224), not a rectangle such as (224x448)?

  3. Is the mask value fixed when the size is converted from rectangle to square?

For example, the original mask image size is (600x900), which is binary [0,1]. However, when I applied

1import torchvision.transforms as transforms
2transforms.Compose([
3                        transforms.Resize((300, 300)),
4                        transforms.ToTensor(),
5                        ])
6

to the mask, the output had other values: 0.01, 0.0156, 0.22... except for 0 and 1, since the mask size was converted.

I applied the below code to convert the mask into the binary again if the value is less than 0.3, the value is 0, otherwise, 1.

1import torchvision.transforms as transforms
2transforms.Compose([
3                        transforms.Resize((300, 300)),
4                        transforms.ToTensor(),
5                        ])
6def __getitem__(self, idx):
7    img, mask = self.load_data(idx)
8    if self.img_transforms is not None:
9        img = self.img_transforms(img)
10    if self.mask_transforms is not None:
11        mask = self.mask_transforms(mask)
12        mask = torch.where(mask<=0.3,0,1)
13    return img, mask
14

but I wonder the process is a common approach and efficient.

ANSWER

Answered 2022-Mar-30 at 08:33
  1. PNG, because it is lossless by design.
  2. It depends. More convenient is to use standard resolution, (224x224), I would start with that.
  3. Use resize without interpolation transforms.Resize((300, 300), interpolation=InterpolationMode.NEAREST)

Source https://stackoverflow.com/questions/71669952

QUESTION

Yolov5 object detection training

Asked 2022-Mar-25 at 04:06

Please i need you help concerning my yolov5 training process for object detection!

I try to train my object detection model yolov5 for detecting small object ( scratch). For labelling my images i used roboflow, where i applied some data augmentation and some pre-processing that roboflow offers as a services. when i finish the pre-processing step and the data augmentation roboflow gives the choice for different output format, in my case it is yolov5 pytorch, and roboflow does everything for me splitting the data into training validation and test. Hence, Everything was set up as it should be for my data preparation and i got at the end the folder with data.yaml and the images with its labels, in data.yaml i put the path of my training and validation sets as i saw in the GitHub tutorial for yolov5. I followed the steps very carefully tought.

The problem is when the training start i get nan in the obj and box column as you can see in the picture bellow, that i don't know the reason why, can someone relate to that or give me any clue to find the solution please, it's my first project in computer vision.

This is what i get when the training process starts

This the last message error when the training finish

I think the problem comes maybe from here but i don't know how to fix it, i used the code of yolov5 team as it's in the tuto

The training continue without any problem but the map and precision remains 0 all the process !!

Ps : Here is the link of tuto i followed : https://github.com/ultralytics/yolov5/wiki/Train-Custom-Data

ANSWER

Answered 2021-Dec-04 at 09:38

Running my code in colab worked successfully and the resulats were good. I think that the problem was in my personnel laptop environment maybe the version of pytorch i was using '1.10.0+cu113', or something else ! If you have any advices to set up my environnement for yolov5 properly i would be happy to take from you guys. many Thanks again to @alexheat

Source https://stackoverflow.com/questions/70199621

QUESTION

scatter plot color bar does not look right

Asked 2022-Mar-24 at 22:20

I have written my code to create a scatter plot with a color bar on the right. But the color bar does not look right, in the sense that the color is too light to be mapped to the actual color used in the plot. I am not sure what is missing or wrong here. But I am hoping to get something similar to what's shown here: https://medium.com/@juliansteam/what-bert-topic-modelling-reveal-about-the-2021-unrest-in-south-africa-d0d15629a9b4 (about in the middle of the page)

1df = .... # data loading
2df["topic"] = topics
3
4# Plot parameters
5top_n = topn
6fontsize = 15
7# some data preparation
8to_plot = df.copy()
9to_plot[df.topic >= top_n] = -1
10outliers = to_plot.loc[to_plot.topic == -1]
11non_outliers = to_plot.loc[to_plot.topic != -1]
12
13#the actual plot
14fig, ax = plt.subplots(figsize=(15, 15))
15scatter_outliers = ax.scatter(outliers['x'], outliers['y'], color="#E0E0E0", s=1, alpha=.3)
16scatter = ax.scatter(non_outliers['x'], non_outliers['y'], c=non_outliers['topic'], s=1, alpha=.3, cmap='hsv_r')
17ax.text(0.99, 0.01, f"BERTopic - Top {top_n} topics", transform=ax.transAxes, horizontalalignment="right", color="black")
18plt.xticks([], [])
19plt.yticks([], [])
20plt.colorbar(scatter)
21plt.savefig(outfile+"_1.png", format='png', dpi=300)
22plt.clf()
23plt.close()
24

As you can see, an example plot looks like this. The color bar is created, but compared to that shown in the link above, the color is very light and does not seem to map to those on the scatter plot. Any suggestions? enter image description here

ANSWER

Answered 2022-Mar-24 at 22:20

The colorbar uses the given alpha=.3. In the scatterplot, many dots with the same color are superimposed, causing them to look brighter than a single dot.

One way to tackle this, is to create a ScalarMappable object to be used by the colorbar, taking the colormap and the norm of the scatter plot (but not its alpha). Note that simply changing the alpha of the scatter object (scatter.set_alpha(1)) would also change the plot itself.

1df = .... # data loading
2df["topic"] = topics
3
4# Plot parameters
5top_n = topn
6fontsize = 15
7# some data preparation
8to_plot = df.copy()
9to_plot[df.topic >= top_n] = -1
10outliers = to_plot.loc[to_plot.topic == -1]
11non_outliers = to_plot.loc[to_plot.topic != -1]
12
13#the actual plot
14fig, ax = plt.subplots(figsize=(15, 15))
15scatter_outliers = ax.scatter(outliers['x'], outliers['y'], color="#E0E0E0", s=1, alpha=.3)
16scatter = ax.scatter(non_outliers['x'], non_outliers['y'], c=non_outliers['topic'], s=1, alpha=.3, cmap='hsv_r')
17ax.text(0.99, 0.01, f"BERTopic - Top {top_n} topics", transform=ax.transAxes, horizontalalignment="right", color="black")
18plt.xticks([], [])
19plt.yticks([], [])
20plt.colorbar(scatter)
21plt.savefig(outfile+"_1.png", format='png', dpi=300)
22plt.clf()
23plt.close()
24import matplotlib.pyplot as plt
25from matplotlib.cm import ScalarMappable
26import numpy as np
27
28x = np.random.normal(np.repeat(np.random.uniform(0, 20, 10), 1000))
29y = np.random.normal(np.repeat(np.random.uniform(0, 10, 10), 1000))
30c = np.repeat(np.arange(10), 1000)
31
32scatter = plt.scatter(x, y, c=c, cmap='hsv_r', alpha=.3, s=3)
33plt.colorbar(ScalarMappable(cmap=scatter.get_cmap(), norm=scatter.norm))
34plt.tight_layout()
35plt.show()
36

scatterplot with colorbar without alpha

Source https://stackoverflow.com/questions/71608280

QUESTION

using DelimitedFiles with title of the header

Asked 2022-Mar-09 at 19:26

When importing a .csv file, is there any way to read the data from the title of the header? Consider the .csv file in the following:

I mean, instead of start_node = round.(Int64, data[:,1]) is there another way to say "start_node" is the one in the .csv file that its header is "start node i"

1# Importing packages
2using  DelimitedFiles
3
4
5
6# Data Preparation
7network_data_file = "network.csv"
8network_data = readdlm(network_data_file, ',', header=true)
9data = network_data[1]
10header = network_data[2]
11
12start_node = round.(Int64, data[:,1])
13end_node = round.(Int64, data[:,2])
14

enter image description here

ANSWER

Answered 2022-Mar-09 at 19:08

The most natural way is to use CSV along with the DataFrames package.

Consider file:

1# Importing packages
2using  DelimitedFiles
3
4
5
6# Data Preparation
7network_data_file = "network.csv"
8network_data = readdlm(network_data_file, ',', header=true)
9data = network_data[1]
10header = network_data[2]
11
12start_node = round.(Int64, data[:,1])
13end_node = round.(Int64, data[:,2])
14 open("f.txt", "w") do f
15       println(f,"""start;end;c
16              1;2;3
17              4;5;6
18              """)
19       end
20

You can do:

1# Importing packages
2using  DelimitedFiles
3
4
5
6# Data Preparation
7network_data_file = "network.csv"
8network_data = readdlm(network_data_file, ',', header=true)
9data = network_data[1]
10header = network_data[2]
11
12start_node = round.(Int64, data[:,1])
13end_node = round.(Int64, data[:,2])
14 open("f.txt", "w") do f
15       println(f,"""start;end;c
16              1;2;3
17              4;5;6
18              """)
19       end
20julia> df = CSV.File("f.txt") |> DataFrame
212Ɨ3 DataFrame
22 Row ā”‚ start  end    c
23     ā”‚ Int64  Int64  Int64
24ā”€ā”€ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€
25   1 ā”‚     1      2      3
26   2 ā”‚     4      5      6
27

and now:

1# Importing packages
2using  DelimitedFiles
3
4
5
6# Data Preparation
7network_data_file = "network.csv"
8network_data = readdlm(network_data_file, ',', header=true)
9data = network_data[1]
10header = network_data[2]
11
12start_node = round.(Int64, data[:,1])
13end_node = round.(Int64, data[:,2])
14 open("f.txt", "w") do f
15       println(f,"""start;end;c
16              1;2;3
17              4;5;6
18              """)
19       end
20julia> df = CSV.File("f.txt") |> DataFrame
212Ɨ3 DataFrame
22 Row ā”‚ start  end    c
23     ā”‚ Int64  Int64  Int64
24ā”€ā”€ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€
25   1 ā”‚     1      2      3
26   2 ā”‚     4      5      6
27julia> df.start
282-element Vector{Int64}:
29 1
30 4
31

If you want to stick with DelimitedFiles you could do:

1# Importing packages
2using  DelimitedFiles
3
4
5
6# Data Preparation
7network_data_file = "network.csv"
8network_data = readdlm(network_data_file, ',', header=true)
9data = network_data[1]
10header = network_data[2]
11
12start_node = round.(Int64, data[:,1])
13end_node = round.(Int64, data[:,2])
14 open("f.txt", "w") do f
15       println(f,"""start;end;c
16              1;2;3
17              4;5;6
18              """)
19       end
20julia> df = CSV.File("f.txt") |> DataFrame
212Ɨ3 DataFrame
22 Row ā”‚ start  end    c
23     ā”‚ Int64  Int64  Int64
24ā”€ā”€ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€
25   1 ā”‚     1      2      3
26   2 ā”‚     4      5      6
27julia> df.start
282-element Vector{Int64}:
29 1
30 4
31julia> res = readdlm("f.txt",';'; header=true);
32
33
34julia> DataFrame(res[1], vec(res[2]))
352Ɨ3 DataFrame
36 Row ā”‚ start    end      c
37     ā”‚ Float64  Float64  Float64
38ā”€ā”€ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€
39   1 ā”‚     1.0      2.0      3.0
40   2 ā”‚     4.0      5.0      6.0
41

Going totally without any package:

1# Importing packages
2using  DelimitedFiles
3
4
5
6# Data Preparation
7network_data_file = "network.csv"
8network_data = readdlm(network_data_file, ',', header=true)
9data = network_data[1]
10header = network_data[2]
11
12start_node = round.(Int64, data[:,1])
13end_node = round.(Int64, data[:,2])
14 open("f.txt", "w") do f
15       println(f,"""start;end;c
16              1;2;3
17              4;5;6
18              """)
19       end
20julia> df = CSV.File("f.txt") |> DataFrame
212Ɨ3 DataFrame
22 Row ā”‚ start  end    c
23     ā”‚ Int64  Int64  Int64
24ā”€ā”€ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€
25   1 ā”‚     1      2      3
26   2 ā”‚     4      5      6
27julia> df.start
282-element Vector{Int64}:
29 1
30 4
31julia> res = readdlm("f.txt",';'; header=true);
32
33
34julia> DataFrame(res[1], vec(res[2]))
352Ɨ3 DataFrame
36 Row ā”‚ start    end      c
37     ā”‚ Float64  Float64  Float64
38ā”€ā”€ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€
39   1 ā”‚     1.0      2.0      3.0
40   2 ā”‚     4.0      5.0      6.0
41julia> res[1][:,findfirst(==("start"), vec(res[2]))]
422-element Vector{Float64}:
43 1.0
44 4.0
45

Finally, it is worth noting that if you want readdlm to yield Ints instead of Float64s you could read data as readdlm("f.txt",';',Int,'\n'; header=true)

Source https://stackoverflow.com/questions/71414537

QUESTION

Pytorch : different behaviours in GAN training with different, but conceptually equivalent, code

Asked 2022-Feb-16 at 13:43

I'm trying to implement a simple GAN in Pytorch. The following training code works:

1    for epoch in range(max_epochs):  # loop over the dataset multiple times
2        print(f'epoch: {epoch}')
3        running_loss = 0.0
4
5        for batch_idx,(data,_) in enumerate(data_gen_fn):
6   
7            # data preparation
8            real_data            = data
9            input_shape          = real_data.shape
10            inputs_generator     = torch.randn(*input_shape).detach() 
11
12            # generator forward
13            fake_data            = generator(inputs_generator).detach()
14            # discriminator forward
15            optimizer_generator.zero_grad()
16            optimizer_discriminator.zero_grad()
17
18            #################### ALERT CODE #######################
19            predictions_on_real = discriminator(real_data)
20            predictions_on_fake = discriminator(fake_data)
21
22            predictions = torch.cat((predictions_on_real,
23                                     predictions_on_fake), dim=0)
24           #########################################################
25
26            # loss discriminator
27            labels_real_fake           = torch.tensor([1]*batch_size + [0]*batch_size)
28            loss_discriminator_batch   = criterion_discriminator(predictions, 
29                                                          labels_real_fake)
30            # update discriminator
31            loss_discriminator_batch.backward()
32            optimizer_discriminator.step()
33
34
35            # generator
36            # zero the parameter gradients
37            optimizer_discriminator.zero_grad()
38            optimizer_generator.zero_grad()
39
40            fake_data            = generator(inputs_generator) # make again fake data but without detaching
41            predictions_on_fake  = discriminator(fake_data) # D(G(encoding))
42            
43            # loss generator           
44            labels_fake          = torch.tensor([1]*batch_size)
45            loss_generator_batch = criterion_generator(predictions_on_fake, 
46                                                       labels_fake)
47  
48            loss_generator_batch.backward()  # dL(D(G(encoding)))/dW_{G,D}
49            optimizer_generator.step()
50

If I plot the generated images for each iteration, I see that the generated images look like the real ones, so the training procedure seems to work well.

However, if I try to change the code in the ALERT CODE part , i.e., instead of:

1    for epoch in range(max_epochs):  # loop over the dataset multiple times
2        print(f'epoch: {epoch}')
3        running_loss = 0.0
4
5        for batch_idx,(data,_) in enumerate(data_gen_fn):
6   
7            # data preparation
8            real_data            = data
9            input_shape          = real_data.shape
10            inputs_generator     = torch.randn(*input_shape).detach() 
11
12            # generator forward
13            fake_data            = generator(inputs_generator).detach()
14            # discriminator forward
15            optimizer_generator.zero_grad()
16            optimizer_discriminator.zero_grad()
17
18            #################### ALERT CODE #######################
19            predictions_on_real = discriminator(real_data)
20            predictions_on_fake = discriminator(fake_data)
21
22            predictions = torch.cat((predictions_on_real,
23                                     predictions_on_fake), dim=0)
24           #########################################################
25
26            # loss discriminator
27            labels_real_fake           = torch.tensor([1]*batch_size + [0]*batch_size)
28            loss_discriminator_batch   = criterion_discriminator(predictions, 
29                                                          labels_real_fake)
30            # update discriminator
31            loss_discriminator_batch.backward()
32            optimizer_discriminator.step()
33
34
35            # generator
36            # zero the parameter gradients
37            optimizer_discriminator.zero_grad()
38            optimizer_generator.zero_grad()
39
40            fake_data            = generator(inputs_generator) # make again fake data but without detaching
41            predictions_on_fake  = discriminator(fake_data) # D(G(encoding))
42            
43            # loss generator           
44            labels_fake          = torch.tensor([1]*batch_size)
45            loss_generator_batch = criterion_generator(predictions_on_fake, 
46                                                       labels_fake)
47  
48            loss_generator_batch.backward()  # dL(D(G(encoding)))/dW_{G,D}
49            optimizer_generator.step()
50   #################### ALERT CODE #######################
51   predictions_on_real = discriminator(real_data)
52   predictions_on_fake = discriminator(fake_data)
53
54   predictions = torch.cat((predictions_on_real,
55                            predictions_on_fake), dim=0)
56   #########################################################
57

I use the following:

1    for epoch in range(max_epochs):  # loop over the dataset multiple times
2        print(f'epoch: {epoch}')
3        running_loss = 0.0
4
5        for batch_idx,(data,_) in enumerate(data_gen_fn):
6   
7            # data preparation
8            real_data            = data
9            input_shape          = real_data.shape
10            inputs_generator     = torch.randn(*input_shape).detach() 
11
12            # generator forward
13            fake_data            = generator(inputs_generator).detach()
14            # discriminator forward
15            optimizer_generator.zero_grad()
16            optimizer_discriminator.zero_grad()
17
18            #################### ALERT CODE #######################
19            predictions_on_real = discriminator(real_data)
20            predictions_on_fake = discriminator(fake_data)
21
22            predictions = torch.cat((predictions_on_real,
23                                     predictions_on_fake), dim=0)
24           #########################################################
25
26            # loss discriminator
27            labels_real_fake           = torch.tensor([1]*batch_size + [0]*batch_size)
28            loss_discriminator_batch   = criterion_discriminator(predictions, 
29                                                          labels_real_fake)
30            # update discriminator
31            loss_discriminator_batch.backward()
32            optimizer_discriminator.step()
33
34
35            # generator
36            # zero the parameter gradients
37            optimizer_discriminator.zero_grad()
38            optimizer_generator.zero_grad()
39
40            fake_data            = generator(inputs_generator) # make again fake data but without detaching
41            predictions_on_fake  = discriminator(fake_data) # D(G(encoding))
42            
43            # loss generator           
44            labels_fake          = torch.tensor([1]*batch_size)
45            loss_generator_batch = criterion_generator(predictions_on_fake, 
46                                                       labels_fake)
47  
48            loss_generator_batch.backward()  # dL(D(G(encoding)))/dW_{G,D}
49            optimizer_generator.step()
50   #################### ALERT CODE #######################
51   predictions_on_real = discriminator(real_data)
52   predictions_on_fake = discriminator(fake_data)
53
54   predictions = torch.cat((predictions_on_real,
55                            predictions_on_fake), dim=0)
56   #########################################################
57   #################### ALERT CODE #######################
58   predictions = discriminator(torch.cat( (real_data, fake_data), dim=0))
59   #######################################################
60

That is conceptually the same (in a nutshell, instead of doing two different forward on the discriminator, the former on the real, the latter on the fake data, and finally concatenate the results, with the new code I first concatenate real and fake data, and finally I make just one forward pass on the concatenated data.

However, this code version does not work, that is the generated images seems to be always random noise.

Any explanation to this behavior?

ANSWER

Answered 2022-Feb-16 at 13:43
Why do we different results?

Supplying inputs in either the same batch, or separate batches, can make a difference if the model includes dependencies between different elements of the batch. By far the most common source in current deep learning models is batch normalization. As you mentioned, the discriminator does include batchnorm, so this is likely the reason for different behaviors. Here is an example. Using single numbers and a batch size of 4:

1    for epoch in range(max_epochs):  # loop over the dataset multiple times
2        print(f'epoch: {epoch}')
3        running_loss = 0.0
4
5        for batch_idx,(data,_) in enumerate(data_gen_fn):
6   
7            # data preparation
8            real_data            = data
9            input_shape          = real_data.shape
10            inputs_generator     = torch.randn(*input_shape).detach() 
11
12            # generator forward
13            fake_data            = generator(inputs_generator).detach()
14            # discriminator forward
15            optimizer_generator.zero_grad()
16            optimizer_discriminator.zero_grad()
17
18            #################### ALERT CODE #######################
19            predictions_on_real = discriminator(real_data)
20            predictions_on_fake = discriminator(fake_data)
21
22            predictions = torch.cat((predictions_on_real,
23                                     predictions_on_fake), dim=0)
24           #########################################################
25
26            # loss discriminator
27            labels_real_fake           = torch.tensor([1]*batch_size + [0]*batch_size)
28            loss_discriminator_batch   = criterion_discriminator(predictions, 
29                                                          labels_real_fake)
30            # update discriminator
31            loss_discriminator_batch.backward()
32            optimizer_discriminator.step()
33
34
35            # generator
36            # zero the parameter gradients
37            optimizer_discriminator.zero_grad()
38            optimizer_generator.zero_grad()
39
40            fake_data            = generator(inputs_generator) # make again fake data but without detaching
41            predictions_on_fake  = discriminator(fake_data) # D(G(encoding))
42            
43            # loss generator           
44            labels_fake          = torch.tensor([1]*batch_size)
45            loss_generator_batch = criterion_generator(predictions_on_fake, 
46                                                       labels_fake)
47  
48            loss_generator_batch.backward()  # dL(D(G(encoding)))/dW_{G,D}
49            optimizer_generator.step()
50   #################### ALERT CODE #######################
51   predictions_on_real = discriminator(real_data)
52   predictions_on_fake = discriminator(fake_data)
53
54   predictions = torch.cat((predictions_on_real,
55                            predictions_on_fake), dim=0)
56   #########################################################
57   #################### ALERT CODE #######################
58   predictions = discriminator(torch.cat( (real_data, fake_data), dim=0))
59   #######################################################
60features = [1., 2., 5., 6.]
61print("mean {}, std {}".format(np.mean(features), np.std(features)))
62
63print("normalized features", (features - np.mean(features)) / np.std(features))
64
65>>>mean 3.5, std 2.0615528128088303
66>>>normalized features [-1.21267813 -0.72760688  0.72760688  1.21267813]
67

Now we split the batch into two parts. First part:

1    for epoch in range(max_epochs):  # loop over the dataset multiple times
2        print(f'epoch: {epoch}')
3        running_loss = 0.0
4
5        for batch_idx,(data,_) in enumerate(data_gen_fn):
6   
7            # data preparation
8            real_data            = data
9            input_shape          = real_data.shape
10            inputs_generator     = torch.randn(*input_shape).detach() 
11
12            # generator forward
13            fake_data            = generator(inputs_generator).detach()
14            # discriminator forward
15            optimizer_generator.zero_grad()
16            optimizer_discriminator.zero_grad()
17
18            #################### ALERT CODE #######################
19            predictions_on_real = discriminator(real_data)
20            predictions_on_fake = discriminator(fake_data)
21
22            predictions = torch.cat((predictions_on_real,
23                                     predictions_on_fake), dim=0)
24           #########################################################
25
26            # loss discriminator
27            labels_real_fake           = torch.tensor([1]*batch_size + [0]*batch_size)
28            loss_discriminator_batch   = criterion_discriminator(predictions, 
29                                                          labels_real_fake)
30            # update discriminator
31            loss_discriminator_batch.backward()
32            optimizer_discriminator.step()
33
34
35            # generator
36            # zero the parameter gradients
37            optimizer_discriminator.zero_grad()
38            optimizer_generator.zero_grad()
39
40            fake_data            = generator(inputs_generator) # make again fake data but without detaching
41            predictions_on_fake  = discriminator(fake_data) # D(G(encoding))
42            
43            # loss generator           
44            labels_fake          = torch.tensor([1]*batch_size)
45            loss_generator_batch = criterion_generator(predictions_on_fake, 
46                                                       labels_fake)
47  
48            loss_generator_batch.backward()  # dL(D(G(encoding)))/dW_{G,D}
49            optimizer_generator.step()
50   #################### ALERT CODE #######################
51   predictions_on_real = discriminator(real_data)
52   predictions_on_fake = discriminator(fake_data)
53
54   predictions = torch.cat((predictions_on_real,
55                            predictions_on_fake), dim=0)
56   #########################################################
57   #################### ALERT CODE #######################
58   predictions = discriminator(torch.cat( (real_data, fake_data), dim=0))
59   #######################################################
60features = [1., 2., 5., 6.]
61print("mean {}, std {}".format(np.mean(features), np.std(features)))
62
63print("normalized features", (features - np.mean(features)) / np.std(features))
64
65>>>mean 3.5, std 2.0615528128088303
66>>>normalized features [-1.21267813 -0.72760688  0.72760688  1.21267813]
67features = [1., 2.]
68print("mean {}, std {}".format(np.mean(features), np.std(features)))
69
70print("normalized features", (features - np.mean(features)) / np.std(features))
71
72>>>mean 1.5, std 0.5
73>>>normalized features [-1.  1.]
74

Second part:

1    for epoch in range(max_epochs):  # loop over the dataset multiple times
2        print(f'epoch: {epoch}')
3        running_loss = 0.0
4
5        for batch_idx,(data,_) in enumerate(data_gen_fn):
6   
7            # data preparation
8            real_data            = data
9            input_shape          = real_data.shape
10            inputs_generator     = torch.randn(*input_shape).detach() 
11
12            # generator forward
13            fake_data            = generator(inputs_generator).detach()
14            # discriminator forward
15            optimizer_generator.zero_grad()
16            optimizer_discriminator.zero_grad()
17
18            #################### ALERT CODE #######################
19            predictions_on_real = discriminator(real_data)
20            predictions_on_fake = discriminator(fake_data)
21
22            predictions = torch.cat((predictions_on_real,
23                                     predictions_on_fake), dim=0)
24           #########################################################
25
26            # loss discriminator
27            labels_real_fake           = torch.tensor([1]*batch_size + [0]*batch_size)
28            loss_discriminator_batch   = criterion_discriminator(predictions, 
29                                                          labels_real_fake)
30            # update discriminator
31            loss_discriminator_batch.backward()
32            optimizer_discriminator.step()
33
34
35            # generator
36            # zero the parameter gradients
37            optimizer_discriminator.zero_grad()
38            optimizer_generator.zero_grad()
39
40            fake_data            = generator(inputs_generator) # make again fake data but without detaching
41            predictions_on_fake  = discriminator(fake_data) # D(G(encoding))
42            
43            # loss generator           
44            labels_fake          = torch.tensor([1]*batch_size)
45            loss_generator_batch = criterion_generator(predictions_on_fake, 
46                                                       labels_fake)
47  
48            loss_generator_batch.backward()  # dL(D(G(encoding)))/dW_{G,D}
49            optimizer_generator.step()
50   #################### ALERT CODE #######################
51   predictions_on_real = discriminator(real_data)
52   predictions_on_fake = discriminator(fake_data)
53
54   predictions = torch.cat((predictions_on_real,
55                            predictions_on_fake), dim=0)
56   #########################################################
57   #################### ALERT CODE #######################
58   predictions = discriminator(torch.cat( (real_data, fake_data), dim=0))
59   #######################################################
60features = [1., 2., 5., 6.]
61print("mean {}, std {}".format(np.mean(features), np.std(features)))
62
63print("normalized features", (features - np.mean(features)) / np.std(features))
64
65>>>mean 3.5, std 2.0615528128088303
66>>>normalized features [-1.21267813 -0.72760688  0.72760688  1.21267813]
67features = [1., 2.]
68print("mean {}, std {}".format(np.mean(features), np.std(features)))
69
70print("normalized features", (features - np.mean(features)) / np.std(features))
71
72>>>mean 1.5, std 0.5
73>>>normalized features [-1.  1.]
74features = [5., 6.]
75print("mean {}, std {}".format(np.mean(features), np.std(features)))
76
77print("normalized features", (features - np.mean(features)) / np.std(features))
78
79>>>mean 5.5, std 0.5
80>>>normalized features [-1.  1.]
81

As we can see, in the split-batch version, the two batches are normalized to the exact same numbers, even though the inputs are very different. In the joint-batch version, on the other hand, the larger numbers are still larger than the smaller ones as they are normalized using the same statistics.

Why does this matter?

With deep learning, it's always hard to say, and especially with GANs and their complex training dynamics. A possible explanation is that, as we can see in the example above, the separate batches result in more similar features after normalization even if the original inputs are quite different. This may help early in training, as the generator tends to output "garbage" which has very different statistics from real data.

With a joint batch, these differing statistics make it easy for the discriminator to tell the real and generated data apart, and we end up in a situation where the discriminator "overpowers" the generator.

By using separate batches, however, the different normalizations result in the generated and real data to look more similar, which makes the task less trivial for the discriminator and allows the generator to learn.

Source https://stackoverflow.com/questions/71140227

QUESTION

Is there a way to query a csv file in Karate?

Asked 2022-Feb-02 at 03:20

I am looking for a similar functionality like Fillo Excel API where we can do CRUD operations in an excel file using query like statements.

A select statement in a csv file is a great addition to the framework to provide more flexibility in test data driven approach testing.

Sample scenario: A test case that needs to have multiple data preparation of inserting records to database.

Instead of putting all test data in 1 row or 1 cell like this and do a string split before processing.

1|TC-ID|FNAME               |LNAME               |
2|TC-1 |FNAME1,FNAME2,FNAME3|LNAME1,LNAME2,LNAME3|
3|TC-2 |FNAME4              |LNAME4              |
4

We can design our csv file like this below, when we have something like * def data = read('Select * from persons.csv where TC-ID=TC-1')

1|TC-ID|FNAME               |LNAME               |
2|TC-1 |FNAME1,FNAME2,FNAME3|LNAME1,LNAME2,LNAME3|
3|TC-2 |FNAME4              |LNAME4              |
4|TC-ID|FNAME |LNAME |
5|TC-1 |FNAME1|LNAME1|
6|TC-1 |FNAME2|LNAME2|
7|TC-1 |FNAME3|LNAME3|
8|TC-2 |FNAME4|LNAME4|
9

ANSWER

Answered 2022-Feb-02 at 03:20

There's no need. Karate can transform a CSV file into a JSON array in one line:

1|TC-ID|FNAME               |LNAME               |
2|TC-1 |FNAME1,FNAME2,FNAME3|LNAME1,LNAME2,LNAME3|
3|TC-2 |FNAME4              |LNAME4              |
4|TC-ID|FNAME |LNAME |
5|TC-1 |FNAME1|LNAME1|
6|TC-1 |FNAME2|LNAME2|
7|TC-1 |FNAME3|LNAME3|
8|TC-2 |FNAME4|LNAME4|
9* def data = read('data.csv')
10

After that just use JsonPath or a "filter" operation to "query" for data (search the docs for more examples):

1|TC-ID|FNAME               |LNAME               |
2|TC-1 |FNAME1,FNAME2,FNAME3|LNAME1,LNAME2,LNAME3|
3|TC-2 |FNAME4              |LNAME4              |
4|TC-ID|FNAME |LNAME |
5|TC-1 |FNAME1|LNAME1|
6|TC-1 |FNAME2|LNAME2|
7|TC-1 |FNAME3|LNAME3|
8|TC-2 |FNAME4|LNAME4|
9* def data = read('data.csv')
10* def found = data.find(x => x['TC-ID'] === 'TC-1')
11* def results = data.filter(x => x.FNAME.startsWith('A'))
12

Source https://stackoverflow.com/questions/70949738

QUESTION

Multi Processing with sqlalchemy

Asked 2022-Feb-01 at 22:50

I have a python script that handles data transactions through sqlalchemy using:

1
2def save_update(args):
3    session, engine = create_session(config["DATABASE"])
4
5    try:
6        instance = get_record(session)
7        if instance is None:
8            instance = create_record(session)
9        else:
10            instance = update_record(session, instance)
11
12        sync_errors(session, instance)
13        sync_expressions(session, instance)
14        sync_part(session, instance)
15
16        session.commit()
17    except:
18        session.rollback()
19        write_error(config)
20        raise
21    finally:
22        session.close()
23

On top of the data transactions, I have also some processing unrelated to the database - data preparation before I can do my data transaction. Those pre required tasks are taking some time so I wanted to execute multiples instances in parallel of this full script (data preparation + data transactions with sqlalchemy).

I am thus doing in a different script, simplified example here:

1
2def save_update(args):
3    session, engine = create_session(config["DATABASE"])
4
5    try:
6        instance = get_record(session)
7        if instance is None:
8            instance = create_record(session)
9        else:
10            instance = update_record(session, instance)
11
12        sync_errors(session, instance)
13        sync_expressions(session, instance)
14        sync_part(session, instance)
15
16        session.commit()
17    except:
18        session.rollback()
19        write_error(config)
20        raise
21    finally:
22        session.close()
23process1 = Thread(target=call_script, args=[["python", python_file_path,
24     "-xml", xml_path,
25     "-o", args.outputFolder,
26     "-l", log_path]])
27
28process2 = Thread(target=call_script, args=[["python", python_file_path,
29     "-xml", xml_path,
30     "-o", args.outputFolder,
31     "-l", log_path]])
32
33process1.start()
34process2.start()
35process1.join()
36process2.join()
37

The target function "call_script" executes the firstly mentioned script above (data preparation + data transactions with sqlalchemy):

1
2def save_update(args):
3    session, engine = create_session(config["DATABASE"])
4
5    try:
6        instance = get_record(session)
7        if instance is None:
8            instance = create_record(session)
9        else:
10            instance = update_record(session, instance)
11
12        sync_errors(session, instance)
13        sync_expressions(session, instance)
14        sync_part(session, instance)
15
16        session.commit()
17    except:
18        session.rollback()
19        write_error(config)
20        raise
21    finally:
22        session.close()
23process1 = Thread(target=call_script, args=[["python", python_file_path,
24     "-xml", xml_path,
25     "-o", args.outputFolder,
26     "-l", log_path]])
27
28process2 = Thread(target=call_script, args=[["python", python_file_path,
29     "-xml", xml_path,
30     "-o", args.outputFolder,
31     "-l", log_path]])
32
33process1.start()
34process2.start()
35process1.join()
36process2.join()
37def call_script(args):
38    status = subprocess.call(args, shell=True)
39    print(status)
40

So now to summarize, I will for instance have 2 sub threads + the main one running. Each of those sub thread are executing sqlalchemy code in a separate process.

My question thus is should I be taking care of any specific considerations regarding the multi processing side of my code with sqlalchemy? For me the answer is no as this is multi processing and not multi threading exclusively due to the fact that use subprocess.call() to execute my code.

Now in reality, from time to time I kind of feel I have database locks during execution. Not sure if this is related to my code or someone else is hitting the database while I am processing it as well but I was expecting that each subprocess actually lock the database for when starting to do its work so that other subprocesses are thus waiting for current session to closes.

EDIT

I have used multi processing to replace multi threading for testing:

1
2def save_update(args):
3    session, engine = create_session(config["DATABASE"])
4
5    try:
6        instance = get_record(session)
7        if instance is None:
8            instance = create_record(session)
9        else:
10            instance = update_record(session, instance)
11
12        sync_errors(session, instance)
13        sync_expressions(session, instance)
14        sync_part(session, instance)
15
16        session.commit()
17    except:
18        session.rollback()
19        write_error(config)
20        raise
21    finally:
22        session.close()
23process1 = Thread(target=call_script, args=[["python", python_file_path,
24     "-xml", xml_path,
25     "-o", args.outputFolder,
26     "-l", log_path]])
27
28process2 = Thread(target=call_script, args=[["python", python_file_path,
29     "-xml", xml_path,
30     "-o", args.outputFolder,
31     "-l", log_path]])
32
33process1.start()
34process2.start()
35process1.join()
36process2.join()
37def call_script(args):
38    status = subprocess.call(args, shell=True)
39    print(status)
40    processes = [subprocess.Popen(cmd[0], shell=True) for cmd in commands]
41

I still have same issue on which I got more details: I see SQL Server is showing status "AWAITING COMMAND" and this only goes away when I kill the related python process executing the command. I feel it appears when I am intensely parallelizing the sub processes but really not sure.

Thanks in advance for any support.

ANSWER

Answered 2022-Jan-31 at 06:48

This is an interesting situation. It seems that maybe you can sidestep some of the manual process/thread handling and utilize something like multiprocessing's Pool. I made an example based on some other data initializing code I had. This delegates creating test data and inserting it for each of 10 "devices" to a pool of 3 processes. One caveat that seems necessary is to dispose of the engine before it is shared across fork(), ie. before the Pool tasks are created, this is mentioned here: engine-disposal

1
2def save_update(args):
3    session, engine = create_session(config["DATABASE"])
4
5    try:
6        instance = get_record(session)
7        if instance is None:
8            instance = create_record(session)
9        else:
10            instance = update_record(session, instance)
11
12        sync_errors(session, instance)
13        sync_expressions(session, instance)
14        sync_part(session, instance)
15
16        session.commit()
17    except:
18        session.rollback()
19        write_error(config)
20        raise
21    finally:
22        session.close()
23process1 = Thread(target=call_script, args=[["python", python_file_path,
24     "-xml", xml_path,
25     "-o", args.outputFolder,
26     "-l", log_path]])
27
28process2 = Thread(target=call_script, args=[["python", python_file_path,
29     "-xml", xml_path,
30     "-o", args.outputFolder,
31     "-l", log_path]])
32
33process1.start()
34process2.start()
35process1.join()
36process2.join()
37def call_script(args):
38    status = subprocess.call(args, shell=True)
39    print(status)
40    processes = [subprocess.Popen(cmd[0], shell=True) for cmd in commands]
41from random import randint
42from datetime import datetime
43from multiprocessing import Pool
44
45from sqlalchemy import (
46    create_engine,
47    Integer,
48    DateTime,
49    String,
50)
51from sqlalchemy.schema import (
52    Column,
53    MetaData,
54    ForeignKey,
55)
56from sqlalchemy.orm import declarative_base, relationship, Session, backref
57
58db_uri = 'postgresql+psycopg2://username:password@/database'
59
60engine = create_engine(db_uri, echo=False)
61
62metadata = MetaData()
63
64Base = declarative_base(metadata=metadata)
65
66class Event(Base):
67    __tablename__ = "events"
68    id = Column(Integer, primary_key=True, index=True)
69    created_on = Column(DateTime, nullable=False, index=True)
70    device_id = Column(Integer, ForeignKey('devices.id'), nullable=True)
71    device = relationship('Device', backref=backref("events"))
72
73
74class Device(Base):
75    __tablename__ = "devices"
76    id = Column(Integer, primary_key=True, autoincrement=True)
77    name = Column(String(50))
78
79
80def get_test_data(device_num):
81    """ Generate a test device and its test events for the given device number. """
82    device_dict = dict(name=f'device-{device_num}')
83    event_dicts = []
84    for day in range(1, 5):
85        for hour in range(0, 24):
86            for _ in range(0, randint(0, 50)):
87                event_dicts.append({
88                    "created_on": datetime(day=day, month=1, year=2022, hour=hour),
89                })
90    return (device_dict, event_dicts)
91
92
93def create_test_data(device_num):
94    """ Actually write the test data to the database. """
95    device_dict, event_dicts = get_test_data(device_num)
96    print (f"creating test data for {device_dict['name']}")
97
98    with Session(engine) as session:
99        device = Device(**device_dict)
100        session.add(device)
101        session.flush()
102        events = [Event(**event_dict) for event_dict in event_dicts]
103        event_count = len(events)
104        device.events.extend(events)
105        session.add_all(events)
106        session.commit()
107    return event_count
108
109
110if __name__ == '__main__':
111
112    metadata.create_all(engine)
113
114    # Throw this away before fork.
115    engine.dispose()
116
117    # I have a 4-core processor, so I chose 3.
118    with Pool(3) as p:
119        print (p.map(create_test_data, range(0, 10)))
120
121    # Accessing engine here should still work
122    # but a new connection will be created.
123    with Session(engine) as session:
124        print (session.query(Event).count())
125
126
127
Output
1
2def save_update(args):
3    session, engine = create_session(config["DATABASE"])
4
5    try:
6        instance = get_record(session)
7        if instance is None:
8            instance = create_record(session)
9        else:
10            instance = update_record(session, instance)
11
12        sync_errors(session, instance)
13        sync_expressions(session, instance)
14        sync_part(session, instance)
15
16        session.commit()
17    except:
18        session.rollback()
19        write_error(config)
20        raise
21    finally:
22        session.close()
23process1 = Thread(target=call_script, args=[["python", python_file_path,
24     "-xml", xml_path,
25     "-o", args.outputFolder,
26     "-l", log_path]])
27
28process2 = Thread(target=call_script, args=[["python", python_file_path,
29     "-xml", xml_path,
30     "-o", args.outputFolder,
31     "-l", log_path]])
32
33process1.start()
34process2.start()
35process1.join()
36process2.join()
37def call_script(args):
38    status = subprocess.call(args, shell=True)
39    print(status)
40    processes = [subprocess.Popen(cmd[0], shell=True) for cmd in commands]
41from random import randint
42from datetime import datetime
43from multiprocessing import Pool
44
45from sqlalchemy import (
46    create_engine,
47    Integer,
48    DateTime,
49    String,
50)
51from sqlalchemy.schema import (
52    Column,
53    MetaData,
54    ForeignKey,
55)
56from sqlalchemy.orm import declarative_base, relationship, Session, backref
57
58db_uri = 'postgresql+psycopg2://username:password@/database'
59
60engine = create_engine(db_uri, echo=False)
61
62metadata = MetaData()
63
64Base = declarative_base(metadata=metadata)
65
66class Event(Base):
67    __tablename__ = "events"
68    id = Column(Integer, primary_key=True, index=True)
69    created_on = Column(DateTime, nullable=False, index=True)
70    device_id = Column(Integer, ForeignKey('devices.id'), nullable=True)
71    device = relationship('Device', backref=backref("events"))
72
73
74class Device(Base):
75    __tablename__ = "devices"
76    id = Column(Integer, primary_key=True, autoincrement=True)
77    name = Column(String(50))
78
79
80def get_test_data(device_num):
81    """ Generate a test device and its test events for the given device number. """
82    device_dict = dict(name=f'device-{device_num}')
83    event_dicts = []
84    for day in range(1, 5):
85        for hour in range(0, 24):
86            for _ in range(0, randint(0, 50)):
87                event_dicts.append({
88                    "created_on": datetime(day=day, month=1, year=2022, hour=hour),
89                })
90    return (device_dict, event_dicts)
91
92
93def create_test_data(device_num):
94    """ Actually write the test data to the database. """
95    device_dict, event_dicts = get_test_data(device_num)
96    print (f"creating test data for {device_dict['name']}")
97
98    with Session(engine) as session:
99        device = Device(**device_dict)
100        session.add(device)
101        session.flush()
102        events = [Event(**event_dict) for event_dict in event_dicts]
103        event_count = len(events)
104        device.events.extend(events)
105        session.add_all(events)
106        session.commit()
107    return event_count
108
109
110if __name__ == '__main__':
111
112    metadata.create_all(engine)
113
114    # Throw this away before fork.
115    engine.dispose()
116
117    # I have a 4-core processor, so I chose 3.
118    with Pool(3) as p:
119        print (p.map(create_test_data, range(0, 10)))
120
121    # Accessing engine here should still work
122    # but a new connection will be created.
123    with Session(engine) as session:
124        print (session.query(Event).count())
125
126
127
128creating test data for device-1
129creating test data for device-0
130creating test data for device-2
131creating test data for device-3
132creating test data for device-4
133creating test data for device-5
134creating test data for device-6
135creating test data for device-7
136creating test data for device-8
137creating test data for device-9
138[2511, 2247, 2436, 2106, 2244, 2464, 2358, 2512, 2267, 2451]
13923596
140
141

Source https://stackoverflow.com/questions/70918448

QUESTION

Trouble changing imputer strategy in scikit-learn pipeline

Asked 2022-Jan-27 at 05:26

I am trying to use GridSearchCV to select the best imputer strategy but I am having trouble doing that.

First, I have a data preparation pipeline for numerical and categorical columns-

1from sklearn.compose import ColumnTransformer
2from sklearn.impute import SimpleImputer
3from sklearn.preprocessing import OneHotEncoder, StandardScaler
4from sklearn.pipeline import Pipeline, make_pipeline
5
6num_pipe = make_pipeline(SimpleImputer(strategy='median'), StandardScaler())
7cat_pipe = make_pipeline(SimpleImputer(strategy='constant', fill_value='NA'), 
8                         OneHotEncoder(sparse=False, handle_unknown='ignore'))
9
10preprocessing = ColumnTransformer([
11    ("num", num_pipe, num_cols),
12    ("cat", cat_pipe, cat_cols)
13])
14

Next, I have created a pipeline to train a support vector machine model with feature selection.

1from sklearn.compose import ColumnTransformer
2from sklearn.impute import SimpleImputer
3from sklearn.preprocessing import OneHotEncoder, StandardScaler
4from sklearn.pipeline import Pipeline, make_pipeline
5
6num_pipe = make_pipeline(SimpleImputer(strategy='median'), StandardScaler())
7cat_pipe = make_pipeline(SimpleImputer(strategy='constant', fill_value='NA'), 
8                         OneHotEncoder(sparse=False, handle_unknown='ignore'))
9
10preprocessing = ColumnTransformer([
11    ("num", num_pipe, num_cols),
12    ("cat", cat_pipe, cat_cols)
13])
14from sklearn.feature_selection import SelectFromModel
15
16model = Pipeline([
17    ("preprocess", preprocessing),
18    ("feature_select", SelectFromModel(RandomForestRegressor(random_state=42))),
19    ("regressor", SVR(kernel='rbf', C=30000.0, gamma=0.3))
20])
21

Now, I am trying to see which imputer strategy is best for imputing missing values for numerical columns using a GridSearchCV

1from sklearn.compose import ColumnTransformer
2from sklearn.impute import SimpleImputer
3from sklearn.preprocessing import OneHotEncoder, StandardScaler
4from sklearn.pipeline import Pipeline, make_pipeline
5
6num_pipe = make_pipeline(SimpleImputer(strategy='median'), StandardScaler())
7cat_pipe = make_pipeline(SimpleImputer(strategy='constant', fill_value='NA'), 
8                         OneHotEncoder(sparse=False, handle_unknown='ignore'))
9
10preprocessing = ColumnTransformer([
11    ("num", num_pipe, num_cols),
12    ("cat", cat_pipe, cat_cols)
13])
14from sklearn.feature_selection import SelectFromModel
15
16model = Pipeline([
17    ("preprocess", preprocessing),
18    ("feature_select", SelectFromModel(RandomForestRegressor(random_state=42))),
19    ("regressor", SVR(kernel='rbf', C=30000.0, gamma=0.3))
20])
21grid = {"model.named_steps.preprocess.transformers[0][1].named_steps['simpleimputer'].strategy": 
22        ['mean','median','most_frequent']}
23grid_search = GridSearchCV(model, param_grid = grid, cv=5, scoring='neg_mean_squared_error')
24grid_search.fit(X_train, y_train)
25

This is where I am getting the error. The full pipeline looks like this -

1from sklearn.compose import ColumnTransformer
2from sklearn.impute import SimpleImputer
3from sklearn.preprocessing import OneHotEncoder, StandardScaler
4from sklearn.pipeline import Pipeline, make_pipeline
5
6num_pipe = make_pipeline(SimpleImputer(strategy='median'), StandardScaler())
7cat_pipe = make_pipeline(SimpleImputer(strategy='constant', fill_value='NA'), 
8                         OneHotEncoder(sparse=False, handle_unknown='ignore'))
9
10preprocessing = ColumnTransformer([
11    ("num", num_pipe, num_cols),
12    ("cat", cat_pipe, cat_cols)
13])
14from sklearn.feature_selection import SelectFromModel
15
16model = Pipeline([
17    ("preprocess", preprocessing),
18    ("feature_select", SelectFromModel(RandomForestRegressor(random_state=42))),
19    ("regressor", SVR(kernel='rbf', C=30000.0, gamma=0.3))
20])
21grid = {"model.named_steps.preprocess.transformers[0][1].named_steps['simpleimputer'].strategy": 
22        ['mean','median','most_frequent']}
23grid_search = GridSearchCV(model, param_grid = grid, cv=5, scoring='neg_mean_squared_error')
24grid_search.fit(X_train, y_train)
25Pipeline(steps=[('preprocess',
26                 ColumnTransformer(transformers=[('num',
27                                                  Pipeline(steps=[('simpleimputer',
28                                                                   SimpleImputer(strategy='median')),
29                                                                  ('standardscaler',
30                                                                   StandardScaler())]),
31                                                  ['longitude', 'latitude',
32                                                   'housing_median_age',
33                                                   'total_rooms',
34                                                   'total_bedrooms',
35                                                   'population', 'households',
36                                                   'median_income']),
37                                                 ('cat',
38                                                  Pipeline(steps=[('simpleimputer',
39                                                                   SimpleImputer(fill_value='NA',
40                                                                                 strategy='constant')),
41                                                                  ('onehotencoder',
42                                                                   OneHotEncoder(handle_unknown='ignore',
43                                                                                 sparse=False))]),
44                                                  ['ocean_proximity'])])),
45                ('feature_select',
46                 SelectFromModel(estimator=RandomForestRegressor(random_state=42))),
47                ('regressor', SVR(C=30000.0, gamma=0.3))])
48

Can anyone tell me what I need to change in the grid search to make it work?

ANSWER

Answered 2022-Jan-27 at 05:26

The way you specify the parameter is via a dictionary that maps the name of the estimator/transformer and name of the parameter you want to change to the parameters you want to try. If you have a pipeline or a pipeline of pipelines, the name is the names of all its parents combined with a double underscore. So for your case, it looks like

1from sklearn.compose import ColumnTransformer
2from sklearn.impute import SimpleImputer
3from sklearn.preprocessing import OneHotEncoder, StandardScaler
4from sklearn.pipeline import Pipeline, make_pipeline
5
6num_pipe = make_pipeline(SimpleImputer(strategy='median'), StandardScaler())
7cat_pipe = make_pipeline(SimpleImputer(strategy='constant', fill_value='NA'), 
8                         OneHotEncoder(sparse=False, handle_unknown='ignore'))
9
10preprocessing = ColumnTransformer([
11    ("num", num_pipe, num_cols),
12    ("cat", cat_pipe, cat_cols)
13])
14from sklearn.feature_selection import SelectFromModel
15
16model = Pipeline([
17    ("preprocess", preprocessing),
18    ("feature_select", SelectFromModel(RandomForestRegressor(random_state=42))),
19    ("regressor", SVR(kernel='rbf', C=30000.0, gamma=0.3))
20])
21grid = {"model.named_steps.preprocess.transformers[0][1].named_steps['simpleimputer'].strategy": 
22        ['mean','median','most_frequent']}
23grid_search = GridSearchCV(model, param_grid = grid, cv=5, scoring='neg_mean_squared_error')
24grid_search.fit(X_train, y_train)
25Pipeline(steps=[('preprocess',
26                 ColumnTransformer(transformers=[('num',
27                                                  Pipeline(steps=[('simpleimputer',
28                                                                   SimpleImputer(strategy='median')),
29                                                                  ('standardscaler',
30                                                                   StandardScaler())]),
31                                                  ['longitude', 'latitude',
32                                                   'housing_median_age',
33                                                   'total_rooms',
34                                                   'total_bedrooms',
35                                                   'population', 'households',
36                                                   'median_income']),
37                                                 ('cat',
38                                                  Pipeline(steps=[('simpleimputer',
39                                                                   SimpleImputer(fill_value='NA',
40                                                                                 strategy='constant')),
41                                                                  ('onehotencoder',
42                                                                   OneHotEncoder(handle_unknown='ignore',
43                                                                                 sparse=False))]),
44                                                  ['ocean_proximity'])])),
45                ('feature_select',
46                 SelectFromModel(estimator=RandomForestRegressor(random_state=42))),
47                ('regressor', SVR(C=30000.0, gamma=0.3))])
48gird = {
49    "preprocess__num__simpleimputer__strategy":['median']
50}
51

simpleimputer is simply the name that was automatically assigned by make_pipeline.

However, I think there are other issues in your code like fill_value='NA' being incorrect and actually not needed as it is not the falues to be filled but the value needed to filling missing values.

Source https://stackoverflow.com/questions/70873560

QUESTION

Does tensorflow re-initialize weights when training in a for loop?

Asked 2022-Jan-20 at 15:12

I'm training a model within a for loop, because...I can. I know there are alternative like tf.Dataset API with generators to stream data from disk, but my question is on the specific case of a loop.

Does TF re-initialize weights of the model at the beginning of each loop ? Or does the initialization only occurs the first time the model is instantiated ?

EDIT :

1for msn in LIST:
2
3    data = pd.read_parquet(
4        "03 - Data",
5        engine='pyarrow')
6    data = data[column_order]
7    data.rename(columns={"Flight_Id_Int":"Flight_Id"}, inplace=True)     
8    
9    
10    """ DATA PREPARATION AND FORMATING """
11    data_clean = clean_and_prepare(data, SEQ_LEN, input_type=model_type, smooth=True)
12        
13    # To keep the chonological order of flight we don't random shuffle   
14    train_idx = np.arange(0, int(len(data_clean)*0.9))
15    test_idx = np.arange(int(len(data_clean)*0.9), len(data_clean))
16
17    
18    train_df = tf.data.Dataset.from_tensor_slices(
19        (data_clean[train_idx], data_clean[train_idx])
20        ).batch(BATCH_SIZE)
21    
22    test_df = tf.data.Dataset.from_tensor_slices(
23        (data_clean[test_idx], data_clean[test_idx])
24        ).batch(BATCH_SIZE)
25
26
27    """ MODEL TRAINING """
28    history = model.fit(train_df,
29                epochs=EPOCHS,
30                validation_data=(test_df),
31                callbacks=[tf.keras.callbacks.EarlyStopping(
32                    monitor="val_loss",
33                    patience=15,
34                    mode="min",
35                    restore_best_weights = True)])
36    
37    plot_train_history(history, "Autoencorder {0} - MSN: {1}".format(model_type, msn))
38

ANSWER

Answered 2022-Jan-20 at 15:06

Weights are initialized when the layers are defined (before fit). It does not re-initialize weights afterward - even if you call fit multiple times.

To show this is the case, I plotted the decision boundary at regular training epochs (by calling fit and then predict):

enter image description here

Source https://stackoverflow.com/questions/70788383

QUESTION

How to force Pytest to execute the only function in parametrize?

Asked 2022-Jan-20 at 14:32

I have 2 tests. I want to run the only one:

1pipenv run pytest -s tmp_test.py::test_my_var
2

But pytest executes both functions in @pytest.mark.parametrize (in both tests)

How can I force Pytest to execute the only get_my_var() function if I run the only test_my_var?

If I run the whole file:

1pipenv run pytest -s tmp_test.py::test_my_var
2pipenv run pytest -s tmp_test.py
3

I want Pytest to execute the code in the following manner:

1pipenv run pytest -s tmp_test.py::test_my_var
2pipenv run pytest -s tmp_test.py
3get_my_var()
4test_my_var()
5get_my_var_1()
6test_my_var_1()
7

Actually, my functions in @pytest.mark.parametrize make some data preparation and both tests use the same entities. So each function in @pytest.mark.parametrize changes the state of the same test data.

That's why I strongly need the sequential order of running parametrization functions just before corresponding test.

1pipenv run pytest -s tmp_test.py::test_my_var
2pipenv run pytest -s tmp_test.py
3get_my_var()
4test_my_var()
5get_my_var_1()
6test_my_var_1()
7def get_my_var():
8    with open('my var', 'w') as f:
9        f.write('my var')
10    return 'my var'
11
12
13def get_my_var_1():
14    with open('my var_1', 'w') as f:
15        f.write('my var_1')
16    return 'my var_1'
17
18
19@pytest.mark.parametrize('my_var', get_my_var())
20def test_my_var(my_var):
21    pass
22
23
24@pytest.mark.parametrize('my_var_1', get_my_var_1())
25def test_my_var_1(my_var_1):
26    pass
27

Or how can I achive the same goal with any other options?

For example, with fixtures. I could use fixtures for data preparation but I need to use the same fixture in different tests because the preparation is the same. So I cannot use scope='session'.

At the same time scope='function' results in fixture runs for every instance of parameterized test.

Is there a way to run fixture (or any other function) the only one time for parameterized test before runs of all parameterized instances?

ANSWER

Answered 2022-Jan-20 at 14:32

It looks like that only something like that can resolved the issue.

1pipenv run pytest -s tmp_test.py::test_my_var
2pipenv run pytest -s tmp_test.py
3get_my_var()
4test_my_var()
5get_my_var_1()
6test_my_var_1()
7def get_my_var():
8    with open('my var', 'w') as f:
9        f.write('my var')
10    return 'my var'
11
12
13def get_my_var_1():
14    with open('my var_1', 'w') as f:
15        f.write('my var_1')
16    return 'my var_1'
17
18
19@pytest.mark.parametrize('my_var', get_my_var())
20def test_my_var(my_var):
21    pass
22
23
24@pytest.mark.parametrize('my_var_1', get_my_var_1())
25def test_my_var_1(my_var_1):
26    pass
27import pytest
28
29current_test = None
30
31
32@pytest.fixture()
33def one_time_per_test_init(request):
34    test_name = request.node.originalname
35    global current_test
36    if current_test != test_name:
37        current_test = test_name
38        init, kwargs = request.param
39        init(**kwargs)
40

Source https://stackoverflow.com/questions/70721021

Community Discussions contain sources that include Stack Exchange Network

Tutorials and Learning Resources in Data Preparation

Tutorials and Learning Resources are not available at this moment for Data Preparation

Share this Page

share link

Get latest updates on Data Preparation