Popular New Releases in Data Preparation
gcoord
v0.3.2
mltrace
Integrating Test functionality
pyclipper
1.2.1
mapbox-vector-tile
Leaflet.geoCSV
Stable
Popular Libraries in Data Preparation
by topojson javascript
4126 NOASSERTION
An extension of GeoJSON that encodes topology! š
by hujiulong typescript
2334 MIT
å°ēåę ē³»č½¬ę¢å·„å ·
by carmen-ruby ruby
1132 NOASSERTION
A repository of geographic regions for Ruby
by topojson shell
825 ISC
Pre-built TopoJSON from Natural Earth.
by jazzband python
677 BSD-3-Clause
Python bindings and utilities for GeoJSON
by calvinmetcalf javascript
513
Convert a Shapefile to GeoJSON. Not many caveats.
by mapbox javascript
480 BSD-3-Clause
fast interface to tiles with pluggable backends - NOT ACTIVELY MAINTAINED
by loglabs python
369 Apache-2.0
Coarse-grained lineage and tracing for machine learning pipelines.
by springmeyer javascript
317 BSD-2-Clause
great circle routes in javascript
Trending New libraries in Data Preparation
by loglabs python
369 Apache-2.0
Coarse-grained lineage and tracing for machine learning pipelines.
by lyhmyd1211 python
269
China province/city/country geoJSON data
by riiid python
79 Apache-2.0
A declarative KubeFlow Management Tool
by LilithWittmann python
69 MIT
Just some scripts to export vector tiles to geojson.
by edouardpoitras rust
33 MIT
Navigate OpenStreetMap data in the terminal
by placemark typescript
31 MIT
a checker for the geojson format. goes beyond a schema, checking semantics and producing character-level warnings.
by gitlab-org/ci-cd html
24
Codequality jobs in pipelines https://docs.gitlab.com/ee/user/project/merge_requests/code_quality.html
by hyperknot python
20 NOASSERTION
Full planet GeoJSON extracts, based on ISO and FIPS codes.
by 4lie go
11 MIT
Fast and in memory geo lookup library
Top Authors in Data Preparation
1
18 Libraries
1501
2
4 Libraries
150
3
4 Libraries
108
4
4 Libraries
214
5
3 Libraries
5092
6
3 Libraries
12
7
2 Libraries
5
8
2 Libraries
8
9
2 Libraries
10
10
2 Libraries
4
1
18 Libraries
1501
2
4 Libraries
150
3
4 Libraries
108
4
4 Libraries
214
5
3 Libraries
5092
6
3 Libraries
12
7
2 Libraries
5
8
2 Libraries
8
9
2 Libraries
10
10
2 Libraries
4
Trending Kits in Data Preparation
No Trending Kits are available at this moment for Data Preparation
Trending Discussions on Data Preparation
Prepare for Binary Masks used for the image segmentation
Yolov5 object detection training
scatter plot color bar does not look right
using DelimitedFiles with title of the header
Pytorch : different behaviours in GAN training with different, but conceptually equivalent, code
Is there a way to query a csv file in Karate?
Multi Processing with sqlalchemy
Trouble changing imputer strategy in scikit-learn pipeline
Does tensorflow re-initialize weights when training in a for loop?
How to force Pytest to execute the only function in parametrize?
QUESTION
Prepare for Binary Masks used for the image segmentation
Asked 2022-Mar-30 at 08:33I am trying to prepare the masks for image segmentation with Pytorch. I have three questions about data preparation.
What is the appropriate data format to save the binary mask in general? PNG? JPEG?
Is the mask size needed to be set square such as (224x224), not a rectangle such as (224x448)?
Is the mask value fixed when the size is converted from rectangle to square?
For example, the original mask image size is (600x900), which is binary [0,1]. However, when I applied
1import torchvision.transforms as transforms
2transforms.Compose([
3 transforms.Resize((300, 300)),
4 transforms.ToTensor(),
5 ])
6
to the mask, the output had other values: 0.01, 0.0156, 0.22... except for 0 and 1, since the mask size was converted.
I applied the below code to convert the mask into the binary again if the value is less than 0.3, the value is 0, otherwise, 1.
1import torchvision.transforms as transforms
2transforms.Compose([
3 transforms.Resize((300, 300)),
4 transforms.ToTensor(),
5 ])
6def __getitem__(self, idx):
7 img, mask = self.load_data(idx)
8 if self.img_transforms is not None:
9 img = self.img_transforms(img)
10 if self.mask_transforms is not None:
11 mask = self.mask_transforms(mask)
12 mask = torch.where(mask<=0.3,0,1)
13 return img, mask
14
but I wonder the process is a common approach and efficient.
ANSWER
Answered 2022-Mar-30 at 08:33- PNG, because it is lossless by design.
- It depends. More convenient is to use standard resolution, (224x224), I would start with that.
- Use resize without interpolation
transforms.Resize((300, 300), interpolation=InterpolationMode.NEAREST)
QUESTION
Yolov5 object detection training
Asked 2022-Mar-25 at 04:06Please i need you help concerning my yolov5 training process for object detection!
I try to train my object detection model yolov5 for detecting small object ( scratch). For labelling my images i used roboflow, where i applied some data augmentation and some pre-processing that roboflow offers as a services. when i finish the pre-processing step and the data augmentation roboflow gives the choice for different output format, in my case it is yolov5 pytorch, and roboflow does everything for me splitting the data into training validation and test. Hence, Everything was set up as it should be for my data preparation and i got at the end the folder with data.yaml and the images with its labels, in data.yaml i put the path of my training and validation sets as i saw in the GitHub tutorial for yolov5. I followed the steps very carefully tought.
The problem is when the training start i get nan in the obj and box column as you can see in the picture bellow, that i don't know the reason why, can someone relate to that or give me any clue to find the solution please, it's my first project in computer vision.
This is what i get when the training process starts
This the last message error when the training finish
The training continue without any problem but the map and precision remains 0 all the process !!
Ps : Here is the link of tuto i followed : https://github.com/ultralytics/yolov5/wiki/Train-Custom-Data
ANSWER
Answered 2021-Dec-04 at 09:38Running my code in colab worked successfully and the resulats were good. I think that the problem was in my personnel laptop environment maybe the version of pytorch i was using '1.10.0+cu113', or something else ! If you have any advices to set up my environnement for yolov5 properly i would be happy to take from you guys. many Thanks again to @alexheat
QUESTION
scatter plot color bar does not look right
Asked 2022-Mar-24 at 22:20I have written my code to create a scatter plot with a color bar on the right. But the color bar does not look right, in the sense that the color is too light to be mapped to the actual color used in the plot. I am not sure what is missing or wrong here. But I am hoping to get something similar to what's shown here: https://medium.com/@juliansteam/what-bert-topic-modelling-reveal-about-the-2021-unrest-in-south-africa-d0d15629a9b4 (about in the middle of the page)
1df = .... # data loading
2df["topic"] = topics
3
4# Plot parameters
5top_n = topn
6fontsize = 15
7# some data preparation
8to_plot = df.copy()
9to_plot[df.topic >= top_n] = -1
10outliers = to_plot.loc[to_plot.topic == -1]
11non_outliers = to_plot.loc[to_plot.topic != -1]
12
13#the actual plot
14fig, ax = plt.subplots(figsize=(15, 15))
15scatter_outliers = ax.scatter(outliers['x'], outliers['y'], color="#E0E0E0", s=1, alpha=.3)
16scatter = ax.scatter(non_outliers['x'], non_outliers['y'], c=non_outliers['topic'], s=1, alpha=.3, cmap='hsv_r')
17ax.text(0.99, 0.01, f"BERTopic - Top {top_n} topics", transform=ax.transAxes, horizontalalignment="right", color="black")
18plt.xticks([], [])
19plt.yticks([], [])
20plt.colorbar(scatter)
21plt.savefig(outfile+"_1.png", format='png', dpi=300)
22plt.clf()
23plt.close()
24
As you can see, an example plot looks like this. The color bar is created, but compared to that shown in the link above, the color is very light and does not seem to map to those on the scatter plot. Any suggestions?
ANSWER
Answered 2022-Mar-24 at 22:20The colorbar uses the given alpha=.3
. In the scatterplot, many dots with the same color are superimposed, causing them to look brighter than a single dot.
One way to tackle this, is to create a ScalarMappable
object to be used by the colorbar, taking the colormap and the norm of the scatter plot (but not its alpha). Note that simply changing the alpha of the scatter object (scatter.set_alpha(1)
) would also change the plot itself.
1df = .... # data loading
2df["topic"] = topics
3
4# Plot parameters
5top_n = topn
6fontsize = 15
7# some data preparation
8to_plot = df.copy()
9to_plot[df.topic >= top_n] = -1
10outliers = to_plot.loc[to_plot.topic == -1]
11non_outliers = to_plot.loc[to_plot.topic != -1]
12
13#the actual plot
14fig, ax = plt.subplots(figsize=(15, 15))
15scatter_outliers = ax.scatter(outliers['x'], outliers['y'], color="#E0E0E0", s=1, alpha=.3)
16scatter = ax.scatter(non_outliers['x'], non_outliers['y'], c=non_outliers['topic'], s=1, alpha=.3, cmap='hsv_r')
17ax.text(0.99, 0.01, f"BERTopic - Top {top_n} topics", transform=ax.transAxes, horizontalalignment="right", color="black")
18plt.xticks([], [])
19plt.yticks([], [])
20plt.colorbar(scatter)
21plt.savefig(outfile+"_1.png", format='png', dpi=300)
22plt.clf()
23plt.close()
24import matplotlib.pyplot as plt
25from matplotlib.cm import ScalarMappable
26import numpy as np
27
28x = np.random.normal(np.repeat(np.random.uniform(0, 20, 10), 1000))
29y = np.random.normal(np.repeat(np.random.uniform(0, 10, 10), 1000))
30c = np.repeat(np.arange(10), 1000)
31
32scatter = plt.scatter(x, y, c=c, cmap='hsv_r', alpha=.3, s=3)
33plt.colorbar(ScalarMappable(cmap=scatter.get_cmap(), norm=scatter.norm))
34plt.tight_layout()
35plt.show()
36
QUESTION
using DelimitedFiles with title of the header
Asked 2022-Mar-09 at 19:26When importing a .csv file, is there any way to read the data from the title of the header? Consider the .csv file in the following:
I mean, instead of start_node = round.(Int64, data[:,1])
is there another way to say "start_node" is the one in the .csv file that its header is "start node i"
1# Importing packages
2using DelimitedFiles
3
4
5
6# Data Preparation
7network_data_file = "network.csv"
8network_data = readdlm(network_data_file, ',', header=true)
9data = network_data[1]
10header = network_data[2]
11
12start_node = round.(Int64, data[:,1])
13end_node = round.(Int64, data[:,2])
14
ANSWER
Answered 2022-Mar-09 at 19:08The most natural way is to use CSV
along with the DataFrames
package.
Consider file:
1# Importing packages
2using DelimitedFiles
3
4
5
6# Data Preparation
7network_data_file = "network.csv"
8network_data = readdlm(network_data_file, ',', header=true)
9data = network_data[1]
10header = network_data[2]
11
12start_node = round.(Int64, data[:,1])
13end_node = round.(Int64, data[:,2])
14 open("f.txt", "w") do f
15 println(f,"""start;end;c
16 1;2;3
17 4;5;6
18 """)
19 end
20
You can do:
1# Importing packages
2using DelimitedFiles
3
4
5
6# Data Preparation
7network_data_file = "network.csv"
8network_data = readdlm(network_data_file, ',', header=true)
9data = network_data[1]
10header = network_data[2]
11
12start_node = round.(Int64, data[:,1])
13end_node = round.(Int64, data[:,2])
14 open("f.txt", "w") do f
15 println(f,"""start;end;c
16 1;2;3
17 4;5;6
18 """)
19 end
20julia> df = CSV.File("f.txt") |> DataFrame
212Ć3 DataFrame
22 Row ā start end c
23 ā Int64 Int64 Int64
24āāāāāā¼āāāāāāāāāāāāāāāāāāāāā
25 1 ā 1 2 3
26 2 ā 4 5 6
27
and now:
1# Importing packages
2using DelimitedFiles
3
4
5
6# Data Preparation
7network_data_file = "network.csv"
8network_data = readdlm(network_data_file, ',', header=true)
9data = network_data[1]
10header = network_data[2]
11
12start_node = round.(Int64, data[:,1])
13end_node = round.(Int64, data[:,2])
14 open("f.txt", "w") do f
15 println(f,"""start;end;c
16 1;2;3
17 4;5;6
18 """)
19 end
20julia> df = CSV.File("f.txt") |> DataFrame
212Ć3 DataFrame
22 Row ā start end c
23 ā Int64 Int64 Int64
24āāāāāā¼āāāāāāāāāāāāāāāāāāāāā
25 1 ā 1 2 3
26 2 ā 4 5 6
27julia> df.start
282-element Vector{Int64}:
29 1
30 4
31
If you want to stick with DelimitedFiles
you could do:
1# Importing packages
2using DelimitedFiles
3
4
5
6# Data Preparation
7network_data_file = "network.csv"
8network_data = readdlm(network_data_file, ',', header=true)
9data = network_data[1]
10header = network_data[2]
11
12start_node = round.(Int64, data[:,1])
13end_node = round.(Int64, data[:,2])
14 open("f.txt", "w") do f
15 println(f,"""start;end;c
16 1;2;3
17 4;5;6
18 """)
19 end
20julia> df = CSV.File("f.txt") |> DataFrame
212Ć3 DataFrame
22 Row ā start end c
23 ā Int64 Int64 Int64
24āāāāāā¼āāāāāāāāāāāāāāāāāāāāā
25 1 ā 1 2 3
26 2 ā 4 5 6
27julia> df.start
282-element Vector{Int64}:
29 1
30 4
31julia> res = readdlm("f.txt",';'; header=true);
32
33
34julia> DataFrame(res[1], vec(res[2]))
352Ć3 DataFrame
36 Row ā start end c
37 ā Float64 Float64 Float64
38āāāāāā¼āāāāāāāāāāāāāāāāāāāāāāāāāāā
39 1 ā 1.0 2.0 3.0
40 2 ā 4.0 5.0 6.0
41
Going totally without any package:
1# Importing packages
2using DelimitedFiles
3
4
5
6# Data Preparation
7network_data_file = "network.csv"
8network_data = readdlm(network_data_file, ',', header=true)
9data = network_data[1]
10header = network_data[2]
11
12start_node = round.(Int64, data[:,1])
13end_node = round.(Int64, data[:,2])
14 open("f.txt", "w") do f
15 println(f,"""start;end;c
16 1;2;3
17 4;5;6
18 """)
19 end
20julia> df = CSV.File("f.txt") |> DataFrame
212Ć3 DataFrame
22 Row ā start end c
23 ā Int64 Int64 Int64
24āāāāāā¼āāāāāāāāāāāāāāāāāāāāā
25 1 ā 1 2 3
26 2 ā 4 5 6
27julia> df.start
282-element Vector{Int64}:
29 1
30 4
31julia> res = readdlm("f.txt",';'; header=true);
32
33
34julia> DataFrame(res[1], vec(res[2]))
352Ć3 DataFrame
36 Row ā start end c
37 ā Float64 Float64 Float64
38āāāāāā¼āāāāāāāāāāāāāāāāāāāāāāāāāāā
39 1 ā 1.0 2.0 3.0
40 2 ā 4.0 5.0 6.0
41julia> res[1][:,findfirst(==("start"), vec(res[2]))]
422-element Vector{Float64}:
43 1.0
44 4.0
45
Finally, it is worth noting that if you want readdlm
to yield Int
s instead of Float64
s you could read data as readdlm("f.txt",';',Int,'\n'; header=true)
QUESTION
Pytorch : different behaviours in GAN training with different, but conceptually equivalent, code
Asked 2022-Feb-16 at 13:43I'm trying to implement a simple GAN in Pytorch. The following training code works:
1 for epoch in range(max_epochs): # loop over the dataset multiple times
2 print(f'epoch: {epoch}')
3 running_loss = 0.0
4
5 for batch_idx,(data,_) in enumerate(data_gen_fn):
6
7 # data preparation
8 real_data = data
9 input_shape = real_data.shape
10 inputs_generator = torch.randn(*input_shape).detach()
11
12 # generator forward
13 fake_data = generator(inputs_generator).detach()
14 # discriminator forward
15 optimizer_generator.zero_grad()
16 optimizer_discriminator.zero_grad()
17
18 #################### ALERT CODE #######################
19 predictions_on_real = discriminator(real_data)
20 predictions_on_fake = discriminator(fake_data)
21
22 predictions = torch.cat((predictions_on_real,
23 predictions_on_fake), dim=0)
24 #########################################################
25
26 # loss discriminator
27 labels_real_fake = torch.tensor([1]*batch_size + [0]*batch_size)
28 loss_discriminator_batch = criterion_discriminator(predictions,
29 labels_real_fake)
30 # update discriminator
31 loss_discriminator_batch.backward()
32 optimizer_discriminator.step()
33
34
35 # generator
36 # zero the parameter gradients
37 optimizer_discriminator.zero_grad()
38 optimizer_generator.zero_grad()
39
40 fake_data = generator(inputs_generator) # make again fake data but without detaching
41 predictions_on_fake = discriminator(fake_data) # D(G(encoding))
42
43 # loss generator
44 labels_fake = torch.tensor([1]*batch_size)
45 loss_generator_batch = criterion_generator(predictions_on_fake,
46 labels_fake)
47
48 loss_generator_batch.backward() # dL(D(G(encoding)))/dW_{G,D}
49 optimizer_generator.step()
50
If I plot the generated images for each iteration, I see that the generated images look like the real ones, so the training procedure seems to work well.
However, if I try to change the code in the ALERT CODE
part , i.e., instead of:
1 for epoch in range(max_epochs): # loop over the dataset multiple times
2 print(f'epoch: {epoch}')
3 running_loss = 0.0
4
5 for batch_idx,(data,_) in enumerate(data_gen_fn):
6
7 # data preparation
8 real_data = data
9 input_shape = real_data.shape
10 inputs_generator = torch.randn(*input_shape).detach()
11
12 # generator forward
13 fake_data = generator(inputs_generator).detach()
14 # discriminator forward
15 optimizer_generator.zero_grad()
16 optimizer_discriminator.zero_grad()
17
18 #################### ALERT CODE #######################
19 predictions_on_real = discriminator(real_data)
20 predictions_on_fake = discriminator(fake_data)
21
22 predictions = torch.cat((predictions_on_real,
23 predictions_on_fake), dim=0)
24 #########################################################
25
26 # loss discriminator
27 labels_real_fake = torch.tensor([1]*batch_size + [0]*batch_size)
28 loss_discriminator_batch = criterion_discriminator(predictions,
29 labels_real_fake)
30 # update discriminator
31 loss_discriminator_batch.backward()
32 optimizer_discriminator.step()
33
34
35 # generator
36 # zero the parameter gradients
37 optimizer_discriminator.zero_grad()
38 optimizer_generator.zero_grad()
39
40 fake_data = generator(inputs_generator) # make again fake data but without detaching
41 predictions_on_fake = discriminator(fake_data) # D(G(encoding))
42
43 # loss generator
44 labels_fake = torch.tensor([1]*batch_size)
45 loss_generator_batch = criterion_generator(predictions_on_fake,
46 labels_fake)
47
48 loss_generator_batch.backward() # dL(D(G(encoding)))/dW_{G,D}
49 optimizer_generator.step()
50 #################### ALERT CODE #######################
51 predictions_on_real = discriminator(real_data)
52 predictions_on_fake = discriminator(fake_data)
53
54 predictions = torch.cat((predictions_on_real,
55 predictions_on_fake), dim=0)
56 #########################################################
57
I use the following:
1 for epoch in range(max_epochs): # loop over the dataset multiple times
2 print(f'epoch: {epoch}')
3 running_loss = 0.0
4
5 for batch_idx,(data,_) in enumerate(data_gen_fn):
6
7 # data preparation
8 real_data = data
9 input_shape = real_data.shape
10 inputs_generator = torch.randn(*input_shape).detach()
11
12 # generator forward
13 fake_data = generator(inputs_generator).detach()
14 # discriminator forward
15 optimizer_generator.zero_grad()
16 optimizer_discriminator.zero_grad()
17
18 #################### ALERT CODE #######################
19 predictions_on_real = discriminator(real_data)
20 predictions_on_fake = discriminator(fake_data)
21
22 predictions = torch.cat((predictions_on_real,
23 predictions_on_fake), dim=0)
24 #########################################################
25
26 # loss discriminator
27 labels_real_fake = torch.tensor([1]*batch_size + [0]*batch_size)
28 loss_discriminator_batch = criterion_discriminator(predictions,
29 labels_real_fake)
30 # update discriminator
31 loss_discriminator_batch.backward()
32 optimizer_discriminator.step()
33
34
35 # generator
36 # zero the parameter gradients
37 optimizer_discriminator.zero_grad()
38 optimizer_generator.zero_grad()
39
40 fake_data = generator(inputs_generator) # make again fake data but without detaching
41 predictions_on_fake = discriminator(fake_data) # D(G(encoding))
42
43 # loss generator
44 labels_fake = torch.tensor([1]*batch_size)
45 loss_generator_batch = criterion_generator(predictions_on_fake,
46 labels_fake)
47
48 loss_generator_batch.backward() # dL(D(G(encoding)))/dW_{G,D}
49 optimizer_generator.step()
50 #################### ALERT CODE #######################
51 predictions_on_real = discriminator(real_data)
52 predictions_on_fake = discriminator(fake_data)
53
54 predictions = torch.cat((predictions_on_real,
55 predictions_on_fake), dim=0)
56 #########################################################
57 #################### ALERT CODE #######################
58 predictions = discriminator(torch.cat( (real_data, fake_data), dim=0))
59 #######################################################
60
That is conceptually the same (in a nutshell, instead of doing two different forward
on the discriminator
, the former on the real
, the latter on the fake
data, and finally concatenate the results, with the new code I first concatenate real
and fake
data, and finally I make just one forward pass on the concatenated data.
However, this code version does not work, that is the generated images seems to be always random noise.
Any explanation to this behavior?
ANSWER
Answered 2022-Feb-16 at 13:43Supplying inputs in either the same batch, or separate batches, can make a difference if the model includes dependencies between different elements of the batch. By far the most common source in current deep learning models is batch normalization. As you mentioned, the discriminator does include batchnorm, so this is likely the reason for different behaviors. Here is an example. Using single numbers and a batch size of 4:
1 for epoch in range(max_epochs): # loop over the dataset multiple times
2 print(f'epoch: {epoch}')
3 running_loss = 0.0
4
5 for batch_idx,(data,_) in enumerate(data_gen_fn):
6
7 # data preparation
8 real_data = data
9 input_shape = real_data.shape
10 inputs_generator = torch.randn(*input_shape).detach()
11
12 # generator forward
13 fake_data = generator(inputs_generator).detach()
14 # discriminator forward
15 optimizer_generator.zero_grad()
16 optimizer_discriminator.zero_grad()
17
18 #################### ALERT CODE #######################
19 predictions_on_real = discriminator(real_data)
20 predictions_on_fake = discriminator(fake_data)
21
22 predictions = torch.cat((predictions_on_real,
23 predictions_on_fake), dim=0)
24 #########################################################
25
26 # loss discriminator
27 labels_real_fake = torch.tensor([1]*batch_size + [0]*batch_size)
28 loss_discriminator_batch = criterion_discriminator(predictions,
29 labels_real_fake)
30 # update discriminator
31 loss_discriminator_batch.backward()
32 optimizer_discriminator.step()
33
34
35 # generator
36 # zero the parameter gradients
37 optimizer_discriminator.zero_grad()
38 optimizer_generator.zero_grad()
39
40 fake_data = generator(inputs_generator) # make again fake data but without detaching
41 predictions_on_fake = discriminator(fake_data) # D(G(encoding))
42
43 # loss generator
44 labels_fake = torch.tensor([1]*batch_size)
45 loss_generator_batch = criterion_generator(predictions_on_fake,
46 labels_fake)
47
48 loss_generator_batch.backward() # dL(D(G(encoding)))/dW_{G,D}
49 optimizer_generator.step()
50 #################### ALERT CODE #######################
51 predictions_on_real = discriminator(real_data)
52 predictions_on_fake = discriminator(fake_data)
53
54 predictions = torch.cat((predictions_on_real,
55 predictions_on_fake), dim=0)
56 #########################################################
57 #################### ALERT CODE #######################
58 predictions = discriminator(torch.cat( (real_data, fake_data), dim=0))
59 #######################################################
60features = [1., 2., 5., 6.]
61print("mean {}, std {}".format(np.mean(features), np.std(features)))
62
63print("normalized features", (features - np.mean(features)) / np.std(features))
64
65>>>mean 3.5, std 2.0615528128088303
66>>>normalized features [-1.21267813 -0.72760688 0.72760688 1.21267813]
67
Now we split the batch into two parts. First part:
1 for epoch in range(max_epochs): # loop over the dataset multiple times
2 print(f'epoch: {epoch}')
3 running_loss = 0.0
4
5 for batch_idx,(data,_) in enumerate(data_gen_fn):
6
7 # data preparation
8 real_data = data
9 input_shape = real_data.shape
10 inputs_generator = torch.randn(*input_shape).detach()
11
12 # generator forward
13 fake_data = generator(inputs_generator).detach()
14 # discriminator forward
15 optimizer_generator.zero_grad()
16 optimizer_discriminator.zero_grad()
17
18 #################### ALERT CODE #######################
19 predictions_on_real = discriminator(real_data)
20 predictions_on_fake = discriminator(fake_data)
21
22 predictions = torch.cat((predictions_on_real,
23 predictions_on_fake), dim=0)
24 #########################################################
25
26 # loss discriminator
27 labels_real_fake = torch.tensor([1]*batch_size + [0]*batch_size)
28 loss_discriminator_batch = criterion_discriminator(predictions,
29 labels_real_fake)
30 # update discriminator
31 loss_discriminator_batch.backward()
32 optimizer_discriminator.step()
33
34
35 # generator
36 # zero the parameter gradients
37 optimizer_discriminator.zero_grad()
38 optimizer_generator.zero_grad()
39
40 fake_data = generator(inputs_generator) # make again fake data but without detaching
41 predictions_on_fake = discriminator(fake_data) # D(G(encoding))
42
43 # loss generator
44 labels_fake = torch.tensor([1]*batch_size)
45 loss_generator_batch = criterion_generator(predictions_on_fake,
46 labels_fake)
47
48 loss_generator_batch.backward() # dL(D(G(encoding)))/dW_{G,D}
49 optimizer_generator.step()
50 #################### ALERT CODE #######################
51 predictions_on_real = discriminator(real_data)
52 predictions_on_fake = discriminator(fake_data)
53
54 predictions = torch.cat((predictions_on_real,
55 predictions_on_fake), dim=0)
56 #########################################################
57 #################### ALERT CODE #######################
58 predictions = discriminator(torch.cat( (real_data, fake_data), dim=0))
59 #######################################################
60features = [1., 2., 5., 6.]
61print("mean {}, std {}".format(np.mean(features), np.std(features)))
62
63print("normalized features", (features - np.mean(features)) / np.std(features))
64
65>>>mean 3.5, std 2.0615528128088303
66>>>normalized features [-1.21267813 -0.72760688 0.72760688 1.21267813]
67features = [1., 2.]
68print("mean {}, std {}".format(np.mean(features), np.std(features)))
69
70print("normalized features", (features - np.mean(features)) / np.std(features))
71
72>>>mean 1.5, std 0.5
73>>>normalized features [-1. 1.]
74
Second part:
1 for epoch in range(max_epochs): # loop over the dataset multiple times
2 print(f'epoch: {epoch}')
3 running_loss = 0.0
4
5 for batch_idx,(data,_) in enumerate(data_gen_fn):
6
7 # data preparation
8 real_data = data
9 input_shape = real_data.shape
10 inputs_generator = torch.randn(*input_shape).detach()
11
12 # generator forward
13 fake_data = generator(inputs_generator).detach()
14 # discriminator forward
15 optimizer_generator.zero_grad()
16 optimizer_discriminator.zero_grad()
17
18 #################### ALERT CODE #######################
19 predictions_on_real = discriminator(real_data)
20 predictions_on_fake = discriminator(fake_data)
21
22 predictions = torch.cat((predictions_on_real,
23 predictions_on_fake), dim=0)
24 #########################################################
25
26 # loss discriminator
27 labels_real_fake = torch.tensor([1]*batch_size + [0]*batch_size)
28 loss_discriminator_batch = criterion_discriminator(predictions,
29 labels_real_fake)
30 # update discriminator
31 loss_discriminator_batch.backward()
32 optimizer_discriminator.step()
33
34
35 # generator
36 # zero the parameter gradients
37 optimizer_discriminator.zero_grad()
38 optimizer_generator.zero_grad()
39
40 fake_data = generator(inputs_generator) # make again fake data but without detaching
41 predictions_on_fake = discriminator(fake_data) # D(G(encoding))
42
43 # loss generator
44 labels_fake = torch.tensor([1]*batch_size)
45 loss_generator_batch = criterion_generator(predictions_on_fake,
46 labels_fake)
47
48 loss_generator_batch.backward() # dL(D(G(encoding)))/dW_{G,D}
49 optimizer_generator.step()
50 #################### ALERT CODE #######################
51 predictions_on_real = discriminator(real_data)
52 predictions_on_fake = discriminator(fake_data)
53
54 predictions = torch.cat((predictions_on_real,
55 predictions_on_fake), dim=0)
56 #########################################################
57 #################### ALERT CODE #######################
58 predictions = discriminator(torch.cat( (real_data, fake_data), dim=0))
59 #######################################################
60features = [1., 2., 5., 6.]
61print("mean {}, std {}".format(np.mean(features), np.std(features)))
62
63print("normalized features", (features - np.mean(features)) / np.std(features))
64
65>>>mean 3.5, std 2.0615528128088303
66>>>normalized features [-1.21267813 -0.72760688 0.72760688 1.21267813]
67features = [1., 2.]
68print("mean {}, std {}".format(np.mean(features), np.std(features)))
69
70print("normalized features", (features - np.mean(features)) / np.std(features))
71
72>>>mean 1.5, std 0.5
73>>>normalized features [-1. 1.]
74features = [5., 6.]
75print("mean {}, std {}".format(np.mean(features), np.std(features)))
76
77print("normalized features", (features - np.mean(features)) / np.std(features))
78
79>>>mean 5.5, std 0.5
80>>>normalized features [-1. 1.]
81
As we can see, in the split-batch version, the two batches are normalized to the exact same numbers, even though the inputs are very different. In the joint-batch version, on the other hand, the larger numbers are still larger than the smaller ones as they are normalized using the same statistics.
Why does this matter?With deep learning, it's always hard to say, and especially with GANs and their complex training dynamics. A possible explanation is that, as we can see in the example above, the separate batches result in more similar features after normalization even if the original inputs are quite different. This may help early in training, as the generator tends to output "garbage" which has very different statistics from real data.
With a joint batch, these differing statistics make it easy for the discriminator to tell the real and generated data apart, and we end up in a situation where the discriminator "overpowers" the generator.
By using separate batches, however, the different normalizations result in the generated and real data to look more similar, which makes the task less trivial for the discriminator and allows the generator to learn.
QUESTION
Is there a way to query a csv file in Karate?
Asked 2022-Feb-02 at 03:20I am looking for a similar functionality like Fillo Excel API where we can do CRUD operations in an excel file using query like statements.
A select statement in a csv file is a great addition to the framework to provide more flexibility in test data driven approach testing.
Sample scenario: A test case that needs to have multiple data preparation of inserting records to database.
Instead of putting all test data in 1 row or 1 cell like this and do a string split before processing.
1|TC-ID|FNAME |LNAME |
2|TC-1 |FNAME1,FNAME2,FNAME3|LNAME1,LNAME2,LNAME3|
3|TC-2 |FNAME4 |LNAME4 |
4
We can design our csv file like this below, when we have something like * def data = read('Select * from persons.csv where TC-ID=TC-1')
1|TC-ID|FNAME |LNAME |
2|TC-1 |FNAME1,FNAME2,FNAME3|LNAME1,LNAME2,LNAME3|
3|TC-2 |FNAME4 |LNAME4 |
4|TC-ID|FNAME |LNAME |
5|TC-1 |FNAME1|LNAME1|
6|TC-1 |FNAME2|LNAME2|
7|TC-1 |FNAME3|LNAME3|
8|TC-2 |FNAME4|LNAME4|
9
ANSWER
Answered 2022-Feb-02 at 03:20There's no need. Karate can transform a CSV file into a JSON array in one line:
1|TC-ID|FNAME |LNAME |
2|TC-1 |FNAME1,FNAME2,FNAME3|LNAME1,LNAME2,LNAME3|
3|TC-2 |FNAME4 |LNAME4 |
4|TC-ID|FNAME |LNAME |
5|TC-1 |FNAME1|LNAME1|
6|TC-1 |FNAME2|LNAME2|
7|TC-1 |FNAME3|LNAME3|
8|TC-2 |FNAME4|LNAME4|
9* def data = read('data.csv')
10
After that just use JsonPath or a "filter" operation to "query" for data (search the docs for more examples):
1|TC-ID|FNAME |LNAME |
2|TC-1 |FNAME1,FNAME2,FNAME3|LNAME1,LNAME2,LNAME3|
3|TC-2 |FNAME4 |LNAME4 |
4|TC-ID|FNAME |LNAME |
5|TC-1 |FNAME1|LNAME1|
6|TC-1 |FNAME2|LNAME2|
7|TC-1 |FNAME3|LNAME3|
8|TC-2 |FNAME4|LNAME4|
9* def data = read('data.csv')
10* def found = data.find(x => x['TC-ID'] === 'TC-1')
11* def results = data.filter(x => x.FNAME.startsWith('A'))
12
QUESTION
Multi Processing with sqlalchemy
Asked 2022-Feb-01 at 22:50I have a python script that handles data transactions through sqlalchemy using:
1
2def save_update(args):
3 session, engine = create_session(config["DATABASE"])
4
5 try:
6 instance = get_record(session)
7 if instance is None:
8 instance = create_record(session)
9 else:
10 instance = update_record(session, instance)
11
12 sync_errors(session, instance)
13 sync_expressions(session, instance)
14 sync_part(session, instance)
15
16 session.commit()
17 except:
18 session.rollback()
19 write_error(config)
20 raise
21 finally:
22 session.close()
23
On top of the data transactions, I have also some processing unrelated to the database - data preparation before I can do my data transaction. Those pre required tasks are taking some time so I wanted to execute multiples instances in parallel of this full script (data preparation + data transactions with sqlalchemy).
I am thus doing in a different script, simplified example here:
1
2def save_update(args):
3 session, engine = create_session(config["DATABASE"])
4
5 try:
6 instance = get_record(session)
7 if instance is None:
8 instance = create_record(session)
9 else:
10 instance = update_record(session, instance)
11
12 sync_errors(session, instance)
13 sync_expressions(session, instance)
14 sync_part(session, instance)
15
16 session.commit()
17 except:
18 session.rollback()
19 write_error(config)
20 raise
21 finally:
22 session.close()
23process1 = Thread(target=call_script, args=[["python", python_file_path,
24 "-xml", xml_path,
25 "-o", args.outputFolder,
26 "-l", log_path]])
27
28process2 = Thread(target=call_script, args=[["python", python_file_path,
29 "-xml", xml_path,
30 "-o", args.outputFolder,
31 "-l", log_path]])
32
33process1.start()
34process2.start()
35process1.join()
36process2.join()
37
The target function "call_script" executes the firstly mentioned script above (data preparation + data transactions with sqlalchemy):
1
2def save_update(args):
3 session, engine = create_session(config["DATABASE"])
4
5 try:
6 instance = get_record(session)
7 if instance is None:
8 instance = create_record(session)
9 else:
10 instance = update_record(session, instance)
11
12 sync_errors(session, instance)
13 sync_expressions(session, instance)
14 sync_part(session, instance)
15
16 session.commit()
17 except:
18 session.rollback()
19 write_error(config)
20 raise
21 finally:
22 session.close()
23process1 = Thread(target=call_script, args=[["python", python_file_path,
24 "-xml", xml_path,
25 "-o", args.outputFolder,
26 "-l", log_path]])
27
28process2 = Thread(target=call_script, args=[["python", python_file_path,
29 "-xml", xml_path,
30 "-o", args.outputFolder,
31 "-l", log_path]])
32
33process1.start()
34process2.start()
35process1.join()
36process2.join()
37def call_script(args):
38 status = subprocess.call(args, shell=True)
39 print(status)
40
So now to summarize, I will for instance have 2 sub threads + the main one running. Each of those sub thread are executing sqlalchemy code in a separate process.
My question thus is should I be taking care of any specific considerations regarding the multi processing side of my code with sqlalchemy? For me the answer is no as this is multi processing and not multi threading exclusively due to the fact that use subprocess.call() to execute my code.
Now in reality, from time to time I kind of feel I have database locks during execution. Not sure if this is related to my code or someone else is hitting the database while I am processing it as well but I was expecting that each subprocess actually lock the database for when starting to do its work so that other subprocesses are thus waiting for current session to closes.
EDIT
I have used multi processing to replace multi threading for testing:
1
2def save_update(args):
3 session, engine = create_session(config["DATABASE"])
4
5 try:
6 instance = get_record(session)
7 if instance is None:
8 instance = create_record(session)
9 else:
10 instance = update_record(session, instance)
11
12 sync_errors(session, instance)
13 sync_expressions(session, instance)
14 sync_part(session, instance)
15
16 session.commit()
17 except:
18 session.rollback()
19 write_error(config)
20 raise
21 finally:
22 session.close()
23process1 = Thread(target=call_script, args=[["python", python_file_path,
24 "-xml", xml_path,
25 "-o", args.outputFolder,
26 "-l", log_path]])
27
28process2 = Thread(target=call_script, args=[["python", python_file_path,
29 "-xml", xml_path,
30 "-o", args.outputFolder,
31 "-l", log_path]])
32
33process1.start()
34process2.start()
35process1.join()
36process2.join()
37def call_script(args):
38 status = subprocess.call(args, shell=True)
39 print(status)
40 processes = [subprocess.Popen(cmd[0], shell=True) for cmd in commands]
41
I still have same issue on which I got more details: I see SQL Server is showing status "AWAITING COMMAND" and this only goes away when I kill the related python process executing the command. I feel it appears when I am intensely parallelizing the sub processes but really not sure.
Thanks in advance for any support.
ANSWER
Answered 2022-Jan-31 at 06:48This is an interesting situation. It seems that maybe you can sidestep some of the manual process/thread handling and utilize something like multiprocessing's Pool. I made an example based on some other data initializing code I had. This delegates creating test data and inserting it for each of 10 "devices" to a pool of 3 processes. One caveat that seems necessary is to dispose of the engine before it is shared across fork()
, ie. before the Pool tasks are created, this is mentioned here: engine-disposal
1
2def save_update(args):
3 session, engine = create_session(config["DATABASE"])
4
5 try:
6 instance = get_record(session)
7 if instance is None:
8 instance = create_record(session)
9 else:
10 instance = update_record(session, instance)
11
12 sync_errors(session, instance)
13 sync_expressions(session, instance)
14 sync_part(session, instance)
15
16 session.commit()
17 except:
18 session.rollback()
19 write_error(config)
20 raise
21 finally:
22 session.close()
23process1 = Thread(target=call_script, args=[["python", python_file_path,
24 "-xml", xml_path,
25 "-o", args.outputFolder,
26 "-l", log_path]])
27
28process2 = Thread(target=call_script, args=[["python", python_file_path,
29 "-xml", xml_path,
30 "-o", args.outputFolder,
31 "-l", log_path]])
32
33process1.start()
34process2.start()
35process1.join()
36process2.join()
37def call_script(args):
38 status = subprocess.call(args, shell=True)
39 print(status)
40 processes = [subprocess.Popen(cmd[0], shell=True) for cmd in commands]
41from random import randint
42from datetime import datetime
43from multiprocessing import Pool
44
45from sqlalchemy import (
46 create_engine,
47 Integer,
48 DateTime,
49 String,
50)
51from sqlalchemy.schema import (
52 Column,
53 MetaData,
54 ForeignKey,
55)
56from sqlalchemy.orm import declarative_base, relationship, Session, backref
57
58db_uri = 'postgresql+psycopg2://username:password@/database'
59
60engine = create_engine(db_uri, echo=False)
61
62metadata = MetaData()
63
64Base = declarative_base(metadata=metadata)
65
66class Event(Base):
67 __tablename__ = "events"
68 id = Column(Integer, primary_key=True, index=True)
69 created_on = Column(DateTime, nullable=False, index=True)
70 device_id = Column(Integer, ForeignKey('devices.id'), nullable=True)
71 device = relationship('Device', backref=backref("events"))
72
73
74class Device(Base):
75 __tablename__ = "devices"
76 id = Column(Integer, primary_key=True, autoincrement=True)
77 name = Column(String(50))
78
79
80def get_test_data(device_num):
81 """ Generate a test device and its test events for the given device number. """
82 device_dict = dict(name=f'device-{device_num}')
83 event_dicts = []
84 for day in range(1, 5):
85 for hour in range(0, 24):
86 for _ in range(0, randint(0, 50)):
87 event_dicts.append({
88 "created_on": datetime(day=day, month=1, year=2022, hour=hour),
89 })
90 return (device_dict, event_dicts)
91
92
93def create_test_data(device_num):
94 """ Actually write the test data to the database. """
95 device_dict, event_dicts = get_test_data(device_num)
96 print (f"creating test data for {device_dict['name']}")
97
98 with Session(engine) as session:
99 device = Device(**device_dict)
100 session.add(device)
101 session.flush()
102 events = [Event(**event_dict) for event_dict in event_dicts]
103 event_count = len(events)
104 device.events.extend(events)
105 session.add_all(events)
106 session.commit()
107 return event_count
108
109
110if __name__ == '__main__':
111
112 metadata.create_all(engine)
113
114 # Throw this away before fork.
115 engine.dispose()
116
117 # I have a 4-core processor, so I chose 3.
118 with Pool(3) as p:
119 print (p.map(create_test_data, range(0, 10)))
120
121 # Accessing engine here should still work
122 # but a new connection will be created.
123 with Session(engine) as session:
124 print (session.query(Event).count())
125
126
127
1
2def save_update(args):
3 session, engine = create_session(config["DATABASE"])
4
5 try:
6 instance = get_record(session)
7 if instance is None:
8 instance = create_record(session)
9 else:
10 instance = update_record(session, instance)
11
12 sync_errors(session, instance)
13 sync_expressions(session, instance)
14 sync_part(session, instance)
15
16 session.commit()
17 except:
18 session.rollback()
19 write_error(config)
20 raise
21 finally:
22 session.close()
23process1 = Thread(target=call_script, args=[["python", python_file_path,
24 "-xml", xml_path,
25 "-o", args.outputFolder,
26 "-l", log_path]])
27
28process2 = Thread(target=call_script, args=[["python", python_file_path,
29 "-xml", xml_path,
30 "-o", args.outputFolder,
31 "-l", log_path]])
32
33process1.start()
34process2.start()
35process1.join()
36process2.join()
37def call_script(args):
38 status = subprocess.call(args, shell=True)
39 print(status)
40 processes = [subprocess.Popen(cmd[0], shell=True) for cmd in commands]
41from random import randint
42from datetime import datetime
43from multiprocessing import Pool
44
45from sqlalchemy import (
46 create_engine,
47 Integer,
48 DateTime,
49 String,
50)
51from sqlalchemy.schema import (
52 Column,
53 MetaData,
54 ForeignKey,
55)
56from sqlalchemy.orm import declarative_base, relationship, Session, backref
57
58db_uri = 'postgresql+psycopg2://username:password@/database'
59
60engine = create_engine(db_uri, echo=False)
61
62metadata = MetaData()
63
64Base = declarative_base(metadata=metadata)
65
66class Event(Base):
67 __tablename__ = "events"
68 id = Column(Integer, primary_key=True, index=True)
69 created_on = Column(DateTime, nullable=False, index=True)
70 device_id = Column(Integer, ForeignKey('devices.id'), nullable=True)
71 device = relationship('Device', backref=backref("events"))
72
73
74class Device(Base):
75 __tablename__ = "devices"
76 id = Column(Integer, primary_key=True, autoincrement=True)
77 name = Column(String(50))
78
79
80def get_test_data(device_num):
81 """ Generate a test device and its test events for the given device number. """
82 device_dict = dict(name=f'device-{device_num}')
83 event_dicts = []
84 for day in range(1, 5):
85 for hour in range(0, 24):
86 for _ in range(0, randint(0, 50)):
87 event_dicts.append({
88 "created_on": datetime(day=day, month=1, year=2022, hour=hour),
89 })
90 return (device_dict, event_dicts)
91
92
93def create_test_data(device_num):
94 """ Actually write the test data to the database. """
95 device_dict, event_dicts = get_test_data(device_num)
96 print (f"creating test data for {device_dict['name']}")
97
98 with Session(engine) as session:
99 device = Device(**device_dict)
100 session.add(device)
101 session.flush()
102 events = [Event(**event_dict) for event_dict in event_dicts]
103 event_count = len(events)
104 device.events.extend(events)
105 session.add_all(events)
106 session.commit()
107 return event_count
108
109
110if __name__ == '__main__':
111
112 metadata.create_all(engine)
113
114 # Throw this away before fork.
115 engine.dispose()
116
117 # I have a 4-core processor, so I chose 3.
118 with Pool(3) as p:
119 print (p.map(create_test_data, range(0, 10)))
120
121 # Accessing engine here should still work
122 # but a new connection will be created.
123 with Session(engine) as session:
124 print (session.query(Event).count())
125
126
127
128creating test data for device-1
129creating test data for device-0
130creating test data for device-2
131creating test data for device-3
132creating test data for device-4
133creating test data for device-5
134creating test data for device-6
135creating test data for device-7
136creating test data for device-8
137creating test data for device-9
138[2511, 2247, 2436, 2106, 2244, 2464, 2358, 2512, 2267, 2451]
13923596
140
141
QUESTION
Trouble changing imputer strategy in scikit-learn pipeline
Asked 2022-Jan-27 at 05:26I am trying to use GridSearchCV to select the best imputer strategy but I am having trouble doing that.
First, I have a data preparation pipeline for numerical and categorical columns-
1from sklearn.compose import ColumnTransformer
2from sklearn.impute import SimpleImputer
3from sklearn.preprocessing import OneHotEncoder, StandardScaler
4from sklearn.pipeline import Pipeline, make_pipeline
5
6num_pipe = make_pipeline(SimpleImputer(strategy='median'), StandardScaler())
7cat_pipe = make_pipeline(SimpleImputer(strategy='constant', fill_value='NA'),
8 OneHotEncoder(sparse=False, handle_unknown='ignore'))
9
10preprocessing = ColumnTransformer([
11 ("num", num_pipe, num_cols),
12 ("cat", cat_pipe, cat_cols)
13])
14
Next, I have created a pipeline to train a support vector machine model with feature selection.
1from sklearn.compose import ColumnTransformer
2from sklearn.impute import SimpleImputer
3from sklearn.preprocessing import OneHotEncoder, StandardScaler
4from sklearn.pipeline import Pipeline, make_pipeline
5
6num_pipe = make_pipeline(SimpleImputer(strategy='median'), StandardScaler())
7cat_pipe = make_pipeline(SimpleImputer(strategy='constant', fill_value='NA'),
8 OneHotEncoder(sparse=False, handle_unknown='ignore'))
9
10preprocessing = ColumnTransformer([
11 ("num", num_pipe, num_cols),
12 ("cat", cat_pipe, cat_cols)
13])
14from sklearn.feature_selection import SelectFromModel
15
16model = Pipeline([
17 ("preprocess", preprocessing),
18 ("feature_select", SelectFromModel(RandomForestRegressor(random_state=42))),
19 ("regressor", SVR(kernel='rbf', C=30000.0, gamma=0.3))
20])
21
Now, I am trying to see which imputer strategy is best for imputing missing values for numerical columns using a GridSearchCV
1from sklearn.compose import ColumnTransformer
2from sklearn.impute import SimpleImputer
3from sklearn.preprocessing import OneHotEncoder, StandardScaler
4from sklearn.pipeline import Pipeline, make_pipeline
5
6num_pipe = make_pipeline(SimpleImputer(strategy='median'), StandardScaler())
7cat_pipe = make_pipeline(SimpleImputer(strategy='constant', fill_value='NA'),
8 OneHotEncoder(sparse=False, handle_unknown='ignore'))
9
10preprocessing = ColumnTransformer([
11 ("num", num_pipe, num_cols),
12 ("cat", cat_pipe, cat_cols)
13])
14from sklearn.feature_selection import SelectFromModel
15
16model = Pipeline([
17 ("preprocess", preprocessing),
18 ("feature_select", SelectFromModel(RandomForestRegressor(random_state=42))),
19 ("regressor", SVR(kernel='rbf', C=30000.0, gamma=0.3))
20])
21grid = {"model.named_steps.preprocess.transformers[0][1].named_steps['simpleimputer'].strategy":
22 ['mean','median','most_frequent']}
23grid_search = GridSearchCV(model, param_grid = grid, cv=5, scoring='neg_mean_squared_error')
24grid_search.fit(X_train, y_train)
25
This is where I am getting the error. The full pipeline looks like this -
1from sklearn.compose import ColumnTransformer
2from sklearn.impute import SimpleImputer
3from sklearn.preprocessing import OneHotEncoder, StandardScaler
4from sklearn.pipeline import Pipeline, make_pipeline
5
6num_pipe = make_pipeline(SimpleImputer(strategy='median'), StandardScaler())
7cat_pipe = make_pipeline(SimpleImputer(strategy='constant', fill_value='NA'),
8 OneHotEncoder(sparse=False, handle_unknown='ignore'))
9
10preprocessing = ColumnTransformer([
11 ("num", num_pipe, num_cols),
12 ("cat", cat_pipe, cat_cols)
13])
14from sklearn.feature_selection import SelectFromModel
15
16model = Pipeline([
17 ("preprocess", preprocessing),
18 ("feature_select", SelectFromModel(RandomForestRegressor(random_state=42))),
19 ("regressor", SVR(kernel='rbf', C=30000.0, gamma=0.3))
20])
21grid = {"model.named_steps.preprocess.transformers[0][1].named_steps['simpleimputer'].strategy":
22 ['mean','median','most_frequent']}
23grid_search = GridSearchCV(model, param_grid = grid, cv=5, scoring='neg_mean_squared_error')
24grid_search.fit(X_train, y_train)
25Pipeline(steps=[('preprocess',
26 ColumnTransformer(transformers=[('num',
27 Pipeline(steps=[('simpleimputer',
28 SimpleImputer(strategy='median')),
29 ('standardscaler',
30 StandardScaler())]),
31 ['longitude', 'latitude',
32 'housing_median_age',
33 'total_rooms',
34 'total_bedrooms',
35 'population', 'households',
36 'median_income']),
37 ('cat',
38 Pipeline(steps=[('simpleimputer',
39 SimpleImputer(fill_value='NA',
40 strategy='constant')),
41 ('onehotencoder',
42 OneHotEncoder(handle_unknown='ignore',
43 sparse=False))]),
44 ['ocean_proximity'])])),
45 ('feature_select',
46 SelectFromModel(estimator=RandomForestRegressor(random_state=42))),
47 ('regressor', SVR(C=30000.0, gamma=0.3))])
48
Can anyone tell me what I need to change in the grid search to make it work?
ANSWER
Answered 2022-Jan-27 at 05:26The way you specify the parameter is via a dictionary that maps the name of the estimator/transformer and name of the parameter you want to change to the parameters you want to try. If you have a pipeline or a pipeline of pipelines, the name is the names of all its parents combined with a double underscore. So for your case, it looks like
1from sklearn.compose import ColumnTransformer
2from sklearn.impute import SimpleImputer
3from sklearn.preprocessing import OneHotEncoder, StandardScaler
4from sklearn.pipeline import Pipeline, make_pipeline
5
6num_pipe = make_pipeline(SimpleImputer(strategy='median'), StandardScaler())
7cat_pipe = make_pipeline(SimpleImputer(strategy='constant', fill_value='NA'),
8 OneHotEncoder(sparse=False, handle_unknown='ignore'))
9
10preprocessing = ColumnTransformer([
11 ("num", num_pipe, num_cols),
12 ("cat", cat_pipe, cat_cols)
13])
14from sklearn.feature_selection import SelectFromModel
15
16model = Pipeline([
17 ("preprocess", preprocessing),
18 ("feature_select", SelectFromModel(RandomForestRegressor(random_state=42))),
19 ("regressor", SVR(kernel='rbf', C=30000.0, gamma=0.3))
20])
21grid = {"model.named_steps.preprocess.transformers[0][1].named_steps['simpleimputer'].strategy":
22 ['mean','median','most_frequent']}
23grid_search = GridSearchCV(model, param_grid = grid, cv=5, scoring='neg_mean_squared_error')
24grid_search.fit(X_train, y_train)
25Pipeline(steps=[('preprocess',
26 ColumnTransformer(transformers=[('num',
27 Pipeline(steps=[('simpleimputer',
28 SimpleImputer(strategy='median')),
29 ('standardscaler',
30 StandardScaler())]),
31 ['longitude', 'latitude',
32 'housing_median_age',
33 'total_rooms',
34 'total_bedrooms',
35 'population', 'households',
36 'median_income']),
37 ('cat',
38 Pipeline(steps=[('simpleimputer',
39 SimpleImputer(fill_value='NA',
40 strategy='constant')),
41 ('onehotencoder',
42 OneHotEncoder(handle_unknown='ignore',
43 sparse=False))]),
44 ['ocean_proximity'])])),
45 ('feature_select',
46 SelectFromModel(estimator=RandomForestRegressor(random_state=42))),
47 ('regressor', SVR(C=30000.0, gamma=0.3))])
48gird = {
49 "preprocess__num__simpleimputer__strategy":['median']
50}
51
simpleimputer is simply the name that was automatically assigned by make_pipeline.
However, I think there are other issues in your code like fill_value='NA' being incorrect and actually not needed as it is not the falues to be filled but the value needed to filling missing values.
QUESTION
Does tensorflow re-initialize weights when training in a for loop?
Asked 2022-Jan-20 at 15:12I'm training a model within a for
loop, because...I can.
I know there are alternative like tf.Dataset
API with generators
to stream data from disk, but my question is on the specific case of a loop.
Does TF re-initialize weights of the model at the beginning of each loop ? Or does the initialization only occurs the first time the model is instantiated ?
EDIT :
1for msn in LIST:
2
3 data = pd.read_parquet(
4 "03 - Data",
5 engine='pyarrow')
6 data = data[column_order]
7 data.rename(columns={"Flight_Id_Int":"Flight_Id"}, inplace=True)
8
9
10 """ DATA PREPARATION AND FORMATING """
11 data_clean = clean_and_prepare(data, SEQ_LEN, input_type=model_type, smooth=True)
12
13 # To keep the chonological order of flight we don't random shuffle
14 train_idx = np.arange(0, int(len(data_clean)*0.9))
15 test_idx = np.arange(int(len(data_clean)*0.9), len(data_clean))
16
17
18 train_df = tf.data.Dataset.from_tensor_slices(
19 (data_clean[train_idx], data_clean[train_idx])
20 ).batch(BATCH_SIZE)
21
22 test_df = tf.data.Dataset.from_tensor_slices(
23 (data_clean[test_idx], data_clean[test_idx])
24 ).batch(BATCH_SIZE)
25
26
27 """ MODEL TRAINING """
28 history = model.fit(train_df,
29 epochs=EPOCHS,
30 validation_data=(test_df),
31 callbacks=[tf.keras.callbacks.EarlyStopping(
32 monitor="val_loss",
33 patience=15,
34 mode="min",
35 restore_best_weights = True)])
36
37 plot_train_history(history, "Autoencorder {0} - MSN: {1}".format(model_type, msn))
38
ANSWER
Answered 2022-Jan-20 at 15:06QUESTION
How to force Pytest to execute the only function in parametrize?
Asked 2022-Jan-20 at 14:32I have 2 tests. I want to run the only one:
1pipenv run pytest -s tmp_test.py::test_my_var
2
But pytest executes both functions in @pytest.mark.parametrize (in both tests)
How can I force Pytest to execute the only get_my_var() function if I run the only test_my_var?
If I run the whole file:
1pipenv run pytest -s tmp_test.py::test_my_var
2pipenv run pytest -s tmp_test.py
3
I want Pytest to execute the code in the following manner:
1pipenv run pytest -s tmp_test.py::test_my_var
2pipenv run pytest -s tmp_test.py
3get_my_var()
4test_my_var()
5get_my_var_1()
6test_my_var_1()
7
Actually, my functions in @pytest.mark.parametrize make some data preparation and both tests use the same entities. So each function in @pytest.mark.parametrize changes the state of the same test data.
That's why I strongly need the sequential order of running parametrization functions just before corresponding test.
1pipenv run pytest -s tmp_test.py::test_my_var
2pipenv run pytest -s tmp_test.py
3get_my_var()
4test_my_var()
5get_my_var_1()
6test_my_var_1()
7def get_my_var():
8 with open('my var', 'w') as f:
9 f.write('my var')
10 return 'my var'
11
12
13def get_my_var_1():
14 with open('my var_1', 'w') as f:
15 f.write('my var_1')
16 return 'my var_1'
17
18
19@pytest.mark.parametrize('my_var', get_my_var())
20def test_my_var(my_var):
21 pass
22
23
24@pytest.mark.parametrize('my_var_1', get_my_var_1())
25def test_my_var_1(my_var_1):
26 pass
27
Or how can I achive the same goal with any other options?
For example, with fixtures. I could use fixtures for data preparation but I need to use the same fixture in different tests because the preparation is the same. So I cannot use scope='session'.
At the same time scope='function' results in fixture runs for every instance of parameterized test.
Is there a way to run fixture (or any other function) the only one time for parameterized test before runs of all parameterized instances?
ANSWER
Answered 2022-Jan-20 at 14:32It looks like that only something like that can resolved the issue.
1pipenv run pytest -s tmp_test.py::test_my_var
2pipenv run pytest -s tmp_test.py
3get_my_var()
4test_my_var()
5get_my_var_1()
6test_my_var_1()
7def get_my_var():
8 with open('my var', 'w') as f:
9 f.write('my var')
10 return 'my var'
11
12
13def get_my_var_1():
14 with open('my var_1', 'w') as f:
15 f.write('my var_1')
16 return 'my var_1'
17
18
19@pytest.mark.parametrize('my_var', get_my_var())
20def test_my_var(my_var):
21 pass
22
23
24@pytest.mark.parametrize('my_var_1', get_my_var_1())
25def test_my_var_1(my_var_1):
26 pass
27import pytest
28
29current_test = None
30
31
32@pytest.fixture()
33def one_time_per_test_init(request):
34 test_name = request.node.originalname
35 global current_test
36 if current_test != test_name:
37 current_test = test_name
38 init, kwargs = request.param
39 init(**kwargs)
40
Community Discussions contain sources that include Stack Exchange Network
Tutorials and Learning Resources in Data Preparation
Tutorials and Learning Resources are not available at this moment for Data Preparation