Tune | The Ultimate .NET Experiment | Game Engine library
kandi X-RAY | Tune Summary
kandi X-RAY | Tune Summary
The Ultimate .NET Experiment.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of Tune
Tune Key Features
Tune Examples and Code Snippets
def make_csv_dataset_v2(
file_pattern,
batch_size,
column_names=None,
column_defaults=None,
label_name=None,
select_columns=None,
field_delim=",",
use_quote_delim=True,
na_value="",
header=True,
num_epochs=
def make_batched_features_dataset_v2(file_pattern,
batch_size,
features,
reader=None,
label_key=None,
def make_tf_record_dataset(file_pattern,
batch_size,
parser_fn=None,
num_epochs=None,
shuffle=True,
shuffle_buffer_
Community Discussions
Trending Discussions on Tune
QUESTION
I'm trying to make sure gcc vectorizes my loops. It turns out, that by using -march=znver1
(or -march=native
) gcc skips some loops even though they can be vectorized. Why does this happen?
In this code, the second loop, which multiplies each element by a scalar is not vectorised:
...ANSWER
Answered 2022-Apr-10 at 02:47The default -mtune=generic
has -mprefer-vector-width=256
, and -mavx2
doesn't change that.
znver1 implies -mprefer-vector-width=128
, because that's all the native width of the HW. An instruction using 32-byte YMM vectors decodes to at least 2 uops, more if it's a lane-crossing shuffle. For simple vertical SIMD like this, 32-byte vectors would be ok; the pipeline handles 2-uop instructions efficiently. (And I think is 6 uops wide but only 5 instructions wide, so max front-end throughput isn't available using only 1-uop instructions). But when vectorization would require shuffling, e.g. with arrays of different element widths, GCC code-gen can get messier with 256-bit or wider.
And vmovdqa ymm0, ymm1
mov-elimination only works on the low 128-bit half on Zen1. Also, normally using 256-bit vectors would imply one should use vzeroupper
afterwards, to avoid performance problems on other CPUs (but not Zen1).
I don't know how Zen1 handles misaligned 32-byte loads/stores where each 16-byte half is aligned but in separate cache lines. If that performs well, GCC might want to consider increasing the znver1 -mprefer-vector-width
to 256. But wider vectors means more cleanup code if the size isn't known to be a multiple of the vector width.
Ideally GCC would be able to detect easy cases like this and use 256-bit vectors there. (Pure vertical, no mixing of element widths, constant size that's am multiple of 32 bytes.) At least on CPUs where that's fine: znver1, but not bdver2 for example where 256-bit stores are always slow due to a CPU design bug.
You can see the result of this choice in the way it vectorizes your first loop, the memset-like loop, with a vmovdqu [rdx], xmm0
. https://godbolt.org/z/E5Tq7Gfzc
So given that GCC has decided to only use 128-bit vectors, which can only hold two uint64_t
elements, it (rightly or wrongly) decides it wouldn't be worth using vpsllq
/ vpaddd
to implement qword *5
as (v<<2) + v
, vs. doing it with integer in one LEA instruction.
Almost certainly wrongly in this case, since it still requires a separate load and store for every element or pair of elements. (And loop overhead since GCC's default is not to unroll except with PGO, -fprofile-use
. SIMD is like loop unrolling, especially on a CPU that handles 256-bit vectors as 2 separate uops.)
I'm not sure exactly what GCC means by "not vectorized: unsupported data-type". x86 doesn't have a SIMD uint64_t
multiply instruction until AVX-512, so perhaps GCC assigns it a cost based on the general case of having to emulate it with multiple 32x32 => 64-bit pmuludq
instructions and a bunch of shuffles. And it's only after it gets over that hump that it realizes that it's actually quite cheap for a constant like 5
with only 2 set bits?
That would explain GCC's decision-making process here, but I'm not sure it's exactly the right explanation. Still, these kinds of factors are what happen in a complex piece of machinery like a compiler. A skilled human can easily make smarter choices, but compilers just do sequences of optimization passes that don't always consider the big picture and all the details at the same time.
-mprefer-vector-width=256
doesn't help:
Not vectorizing uint64_t *= 5
seems to be a GCC9 regression
(The benchmarks in the question confirm that an actual Zen1 CPU gets a nearly 2x speedup, as expected from doing 2x uint64 in 6 uops vs. 1x in 5 uops with scalar. Or 4x uint64_t in 10 uops with 256-bit vectors, including two 128-bit stores which will be the throughput bottleneck along with the front-end.)
Even with -march=znver1 -O3 -mprefer-vector-width=256
, we don't get the *= 5
loop vectorized with GCC9, 10, or 11, or current trunk. As you say, we do with -march=znver2
. https://godbolt.org/z/dMTh7Wxcq
We do get vectorization with those options for uint32_t
(even leaving the vector width at 128-bit). Scalar would cost 4 operations per vector uop (not instruction), regardless of 128 or 256-bit vectorization on Zen1, so this doesn't tell us whether *=
is what makes the cost-model decide not to vectorize, or just the 2 vs. 4 elements per 128-bit internal uop.
With uint64_t
, changing to arr[i] += arr[i]<<2;
still doesn't vectorize, but arr[i] <<= 1;
does. (https://godbolt.org/z/6PMn93Y5G). Even arr[i] <<= 2;
and arr[i] += 123
in the same loop vectorize, to the same instructions that GCC thinks aren't worth it for vectorizing *= 5
, just different operands, constant instead of the original vector again. (Scalar could still use one LEA). So clearly the cost-model isn't looking as far as final x86 asm machine instructions, but I don't know why arr[i] += arr[i]
would be considered more expensive than arr[i] <<= 1;
which is exactly the same thing.
GCC8 does vectorize your loop, even with 128-bit vector width: https://godbolt.org/z/5o6qjc7f6
QUESTION
I am working on a spatial search case for spheres in which I want to find connected spheres. For this aim, I searched around each sphere for spheres that centers are in a (maximum sphere diameter) distance from the searching sphere’s center. At first, I tried to use scipy related methods to do so, but scipy method takes longer times comparing to equivalent numpy method. For scipy, I have determined the number of K-nearest spheres firstly and then find them by cKDTree.query
, which lead to more time consumption. However, it is slower than numpy method even by omitting the first step with a constant value (it is not good to omit the first step in this case). It is contrary to my expectations about scipy spatial searching speed. So, I tried to use some list-loops instead some numpy lines for speeding up using numba prange
. Numba run the code a little faster, but I believe that this code can be optimized for better performances, perhaps by vectorization, using other alternative numpy modules or using numba in another way. I have used iteration on all spheres due to prevent probable memory leaks and …, where number of spheres are high.
ANSWER
Answered 2022-Feb-14 at 10:23Have you tried FLANN?
This code doesn't solve your problem completely. It simply finds the nearest 50 neighbors to each point in your 500000 point dataset:
QUESTION
I'm correctly generating my image Yocto-hardknott-technexion with this:
...ANSWER
Answered 2022-Feb-16 at 16:34The solution was to change imx7d-pico-pi-m4.dtb to imx7d-pico-pi-qca-m4.dtb in the Yocto/Hardknott/technexion configuration file called pico-imx7.conf(described in the post)
QUESTION
I am attempting to fine-tune a BERT model on Google Colab from the Tensorflow Hub using this link.
However, I run into the following error:
...ANSWER
Answered 2021-Dec-31 at 08:18As I don't exactly know what changes you have made in the code... I don't have idea about your dataset. But I can see that you are trying to train the whole datset with one epoch and passing the steps per epoch directly. I would recommend to write it like this
set some batch_size 2^n power (for example 16 or 32 or etc) if you don't want to batch the dataset just set batch_size to 1
QUESTION
Hi I know this is probably a stupid question but what does this error mean in relation to my app.js
file? It didn't appear until I ran my local server. Was working fine prior.
ANSWER
Answered 2021-Dec-22 at 19:36You don't need to have a inside your
as it already did for you
So remove the tag:
QUESTION
I am trying code from this page. I ran up to the part LR (tf-idf)
and got the similar results
After that I decided to try GridSearchCV
. My questions below:
1)
...ANSWER
Answered 2021-Dec-09 at 23:12You end up with the error with precision because some of your penalization is too strong for this model, if you check the results, you get 0 for f1 score when C = 0.001 and C = 0.01
QUESTION
I'm trying to tune hyperparameters for KNN on a quite small datasets ( Kaggle Leaf which has around 990 lines ):
...ANSWER
Answered 2021-Dec-08 at 09:28Not very sure how you trained your model or how the preprocessing was done. The leaf dataset has about 100 labels (species) so you have to take care to split your test and train to ensure an even split of your samples. One reason for the weird accuracy could be that your samples are split unevenly.
Also you would need to scale your features:
QUESTION
This question is the same with How can I check a confusion_matrix after fine-tuning with custom datasets?, on Data Science Stack Exchange.
BackgroundI would like to check a confusion_matrix, including precision, recall, and f1-score like below after fine-tuning with custom datasets.
Fine tuning process and the task are Sequence Classification with IMDb Reviews on the Fine-tuning with custom datasets tutorial on Hugging face.
After finishing the fine-tune with Trainer, how can I check a confusion_matrix in this case?
An image of confusion_matrix, including precision, recall, and f1-score original site: just for example output image
...ANSWER
Answered 2021-Nov-24 at 13:26What you could do in this situation is to iterate on the validation set(or on the test set for that matter) and manually create a list of y_true
and y_pred
.
QUESTION
I've had a bit of a look around Stackoverflow and the wider Internet and identified that the most common causes for this error are conflation of declaration (int var = 1;
) and definition (int var;
), and including .c
files from .h
files.
My small project I just split from one file into several is not doing any of these things. I'm very confused.
I made a copy of the project and deleted all the code in the copy (which was fun) until I reached here:
main.c ...ANSWER
Answered 2021-Nov-10 at 21:14Yes there was a change in behaviour.
In C you are supposed to only define a global variable in one translation unit, other translation unit that want to access the variable should declare it as "extern".
In your code, a.h is included in both a.c and main.c so the variable is defined twice. To fix this you should change the "int test" in a.h to "extern int test", then add "int test" to a.c to define the variable exactly once.
In C a definition of a global variable that does not initialise the variable is considered "tentative". You can have multiple tentative definitions of a variable in the same compilation unit. Multiple tentative defintions in different compilation units are not allowed in standard C, but were historically allowed by C compilers on unix systems.
Older versions of gcc would allow multiple tenative definitions (but not multiple non-tentative definitions) of a global variable in different compilation units by default. gcc-10 does not. You can restore the old behavior with the command line option "-fcommon" but this is discouraged.
QUESTION
I'm trying to fine-tune the ReformerModelWithLMHead (google/reformer-enwik8) for NER. I used the padding sequence length same as in the encode method (max_length = max([len(string) for string in list_of_strings])) along with attention_masks. And I got this error:
ValueError: If training, make sure that config.axial_pos_shape factors: (128, 512) multiply to sequence length. Got prod((128, 512)) != sequence_length: 2248. You might want to consider padding your sequence length to 65536 or changing config.axial_pos_shape.
- When I changed the sequence length to 65536, my colab session crashed by getting all the inputs of 65536 lengths.
- According to the second option(changing config.axial_pos_shape), I cannot change it.
I would like to know, Is there any chance to change config.axial_pos_shape while fine-tuning the model? Or I'm missing something in encoding the input strings for reformer-enwik8?
Thanks!
Question Update: I have tried the following methods:
- By giving paramteres at the time of model instantiation:
model = transformers.ReformerModelWithLMHead.from_pretrained("google/reformer-enwik8", num_labels=9, max_position_embeddings=1024, axial_pos_shape=[16,64], axial_pos_embds_dim=[32,96],hidden_size=128)
It gives me the following error:
RuntimeError: Error(s) in loading state_dict for ReformerModelWithLMHead: size mismatch for reformer.embeddings.word_embeddings.weight: copying a param with shape torch.Size([258, 1024]) from checkpoint, the shape in current model is torch.Size([258, 128]). size mismatch for reformer.embeddings.position_embeddings.weights.0: copying a param with shape torch.Size([128, 1, 256]) from checkpoint, the shape in current model is torch.Size([16, 1, 32]).
This is quite a long error.
- Then I tried this code to update the config:
Reshape Axial Position Embeddings layer to match desired max seq length ...model1 = transformers.ReformerModelWithLMHead.from_pretrained('google/reformer-enwik8', num_labels = 9)
ANSWER
Answered 2021-Aug-15 at 06:11The Reformer model was proposed in the paper Reformer: The Efficient Transformer by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya. The paper contains a method for factorization gigantic matrix which is resulted of working with very long sequences! This factorization is relying on 2 assumptions
- the parameter
config.axial_pos_embds_dim
is set to a tuple(d1,d2)
which sum has to be equal to config.hidden_size config.axial_pos_shape
is set to a tuple(n1s,n2s)
which product has to be equal to config.max_embedding_size (more on these here!)
Finally your question ;)
- I'm almost sure your session crushed duo to ram overflow
- you can change any config parameter during model instantiation like the official documentation!
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install Tune
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page