pre-training | Training Buys Better Robustness and Uncertainty Estimates | Cybersecurity library
kandi X-RAY | pre-training Summary
kandi X-RAY | pre-training Summary
Pre-Training Buys Better Robustness and Uncertainty Estimates (ICML 2019)
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
- Visualize performance comparison
- Compute the soft - loss for a given confidence interval
- Calculate the ROC score
- Calculate the calib error
- Compute the loss function
- Normalize x
- Download the file
- Check the integrity of the files
- Train the model using gradient descent
- Clamp x to a given radius
- Display calibration results
- Splits a dataset into two datasets
- Train the model
- Tune the temperature distribution
- Train the training phase
- Test the loss function
- Create a validation folder for validation
- Compute the C_hat matrix
- Train the network
- Visualize performance
pre-training Key Features
pre-training Examples and Code Snippets
Community Discussions
Trending Discussions on pre-training
QUESTION
I have a couple of issues regarding Gensim in its Word2Vec model.
The first is what is happening if I set it to train for 0 epochs? Does it just create the random vectors and calls it done. So they have to be random every time, correct?
The second is concerning the WV object in the doc page says:
...ANSWER
Answered 2021-May-20 at 18:08I've not tried the nonsense parameter epochs=0
, but it might behave as you expect. (Have you tried it and seen otherwise?)
However, if your real goal is to be able to tamper with the model after initialization, but before training, the usual way to do that is to not supply any corpus when constructing the model instance, and instead manually do the two followup steps, .build_vocab()
& .train()
, in your own code - inserting extra steps between the two. (For even finer-grained control, you can examine the source of .build_vocab()
& its helper methods, and simply ensure you do all those necessary things, with your own extra steps interleaved.)
The "word vectors" in the .wv
property of type KeyedVectors
are essentially the "input projection layer" of the model: the data which converts a single word into a vector_size
-dimensional dense embedding. (You can think of the keys – word token strings – as being somewhat like a one-hot word-encoding.)
So, assigning into that structure only changes that "input projection vector", which is the "word vector" usually collected from the model. If you need to tamper with the hidden-to-output weights, you need to look at the model's .syn1neg
(or .syn1
for HS mode) property.
QUESTION
I looked through different implementations of BERT's Masked Language Model. For pre-training there are two common versions:
- Decoder would simply take the final embedding of the [MASK]ed token and pass it throught a linear layer (without any modifications):
ANSWER
Answered 2021-Apr-12 at 21:12For those who are interested, it is called weight tying or joint input-output embedding. There are two papers that argue for the benefit of this approach:
QUESTION
I want to train some models to work with grayscale images, which e.g. is useful for microscope applications (Source). Therefore I want to train my model on graysale imagenet, using the pytorch grayscale conversion (torchvision.transforms.Grayscale), to convert the RGB imagenet to a grayscale imagenet. Internally pytorch rotates the color space from RGB to YPbPr as follows:
Y' is the grayscale channel then, so that Pb and Pr can be neglected after transformation. Actually pytorch even only calculates
...ANSWER
Answered 2021-Jan-14 at 11:09Okay, I wasn't able to calculate the standard deviation as planned, but did it using the code below. The grayscale imagenet's train dataset mean and standard deviation are (round it as much as you like):
Mean: 0.44531356896770125
Standard Deviation: 0.2692461874154524
QUESTION
I have been using the FUNSD dataset to predict sequence labeling in unstructured documents per this paper: LayoutLM: Pre-training of Text and Layout for Document Image Understanding . The data after cleaning and moving from a dict to a dataframe, looks like this: The dataset is laid out as follows:
- The column
id
is the unique identifier for each word group inside a document, shown in columntext
(like Nodes) - The column
label
identifies whether the word group are classified as a 'question' or an 'answer' - The column
linking
denoting the WordGroups which are 'linked' (like Edges), linking corresponding 'questions' to 'answers' - The column
'box'
denoting the location coordinates (x,y top left, x,ybottom right) of the word group relative to the top left corner (0.0). - The Column
'words'
holds each individual word inside the wordgroup, and its location (box).
I aim to train a classifier to identify words inside the column 'words'
that are linked together by using a Graph Neural Net, and the first step is to be able to transform my current dataset into a Network. My questions are as follows:
Is there a way to break each row in the column
'words'
into a two columns[box_word, text_word]
, each only for one word, while replicating the other columns which remain the same:[id, label, text, box]
, resulting in a final dataframe with these columns:[box,text,label,box_word, text_word]
I can Tokenize the columns
'text'
andtext_word
, one hot encode columnlabel
, split columns with more than one numericbox
andbox_word
into individual columns , but How do I split up/rearrange the colum'linking'
to define the edges of my Network Graph?Am I taking the correct route in Using the dataframe to generate a Network, and use it to train a GNN?
Any and all help/tips is appreciated.
...ANSWER
Answered 2020-Oct-07 at 02:55Edit: process multiple entries in the column words
.
Your questions 1 and 2 are answered in the code. Actually quite simple (assuming the data format is correctly represented by what shown in the screenshot). Digest:
Q1: apply
the splitting function on the column and unpack by .tolist()
such that separate columns can be created. See this post also.
Q2: Use list comprehension to unpack the extra list layer and retain only non-empty edges.
Q3: Yes and no. Yes because pandas
is good at organizing data with heterogeneous types. For example, lists, dict, int and float can be present at different columns. Several I/O functions, such as pd.read_csv()
or pd.read_json()
, are also very handy.
However, there is overhead in data access, and that is especially costly for iterating over rows (records). Therefore, the transformed data that feeds directly into your model is usually converted into numpy.array
or more efficient formats. Such a format conversion task is the data scientist's sole responsibility.
I make up my own sample dataset. Irrelevant columns were ignored (as I am not obliged to and shouldn't do).
QUESTION
I am reading BERT model paper. In Masked Language Model task during pre-training BERT model, the paper said the model will choose 15% token ramdomly. In the chose token (Ti), 80% it will be replaced with [MASK] token, 10% Ti is unchanged and 10% Ti replaced with another word. I think the model just need to replace with [MASK] or another word is enough. Why does the model have to choose randomly a word and keep it unchanged? Does pre-training process predict only [MASK] token or it predict 15% a whole random token?
...ANSWER
Answered 2020-Sep-22 at 16:51This is done because they want to pre-train a bidirectional model. Most of the time the network will see a sentence with a [MASK] token, and its trained to predict the word that is supposed to be there. But in fine-tuning, which is done after pre-training (fine-tuning is the training done by everyone who wants to use BERT on their task), there are no [MASK] tokens! (unless you specifically do masked LM).
This mismatch between pre-training and training (sudden disappearence of the [MASK] token) is softened by them, with a probability of 15% the word is not replaced by [MASK]. The task is still there, the network has to predict the token, but it actually gets the answer already as input. This might seem counterintuitive but makes sense when combined with the [MASK] training.
QUESTION
I am now moving into a natural language processing projects. Before I get my hands dirty, I plan to read other people's works on dataset, where they are organized as a leaderboard (see "Three-way Classification" section).
However, in order to download these papers, I need to manually click on each URL (there are about 50 of them), which is time-consuming. Therefore, I am trying to extract these URLs from HTML, which looks like following:
...ANSWER
Answered 2020-Sep-16 at 02:52A regular expression along with findall() method can be used for finding all the intersting links form the given html content.
BeautifulSoup offers an easy way to read table from html.
The above goal of reading pdf links form a table inside a given html content can be achieved by using regex along with BeautifulSoup.
Working example using regex and along with BeatifulSoup
QUESTION
The code:
...ANSWER
Answered 2020-Aug-29 at 11:24You need to drop the double quotes of None
:
QUESTION
I have setup a Returnn Transformer Model for NMT, which I want to train with an additional loss for every encoder/decoder attention head h
on every decoder layer l
(in addition to the vanilla Cross Entropy loss), i.e.:
ANSWER
Answered 2020-Aug-12 at 23:41You are aware that the training is non-deterministic anyway, right? Did you try to rerun each case a couple of times? Also the baseline? Maybe the baseline itself is an outlier.
Also, changing the computation graph, even if this will be a no-op, can also have an effect. Unfortunately it can be sensitive.
You might want to try setting deterministic_train = True
in your config. This might make it a bit more deterministic. Maybe you get the same result then in each of your cases. This might make it a bit slower, though.
The order of parameter initialization might be different as well. The order depends on the order of when the layers are created. Maybe compare that in the log. It is always the same random initializer, but would use a different seed offset then, so you would get another initialization.
You could play around by explicitly setting random_seed
in the config, and see how much variance you get by that. Maybe all these values are within this range.
For a more in-depth debugging, you could really compare directly the computation graph (in TensorBoard). Maybe there is a difference which you did not notice. Also, maybe make a diff on the log output during net construction, for the case pretrain vs baseline. There should be no diff.
(As this is maybe a mistake, for now only as a side comment: Of course, different RETURNN versions might have some different behavior. So this should be the same.)
Another note: You do not need this tf.reduce_sum
in your loss. Actually that might not be such a good idea. Now it will forget about number of frames, and number of seqs. If you just do not use tf.reduce_sum
, it should also work, but now you get the correct normalization.
Another note: Instead of your lambda
, you can also use loss_scale
, which is simpler, and you get the original value in the log.
So basically, you could write it this way:
QUESTION
BERT pre-training of the base-model is done by a language modeling approach, where we mask certain percent of tokens in a sentence, and we make the model learn those missing mask. Then, I think in order to do downstream tasks, we add a newly initialized layer and we fine-tune the model.
However, suppose we have a gigantic dataset for sentence classification. Theoretically, can we initialize the BERT base architecture from scratch, train both the additional downstream task specific layer + the base model weights form scratch with this sentence classification dataset only, and still achieve a good result?
Thanks.
...ANSWER
Answered 2020-May-16 at 16:05BERT can be viewed as a language encoder, which is trained on a humongous amount of data to learn the language well. As we know, the original BERT model was trained on the entire English Wikipedia and Book corpus, which sums to 3,300M words. BERT-base has 109M model parameters. So, if you think you have large enough data to train BERT, then the answer to your question is yes.
However, when you said "still achieve a good result", I assume you are comparing against the original BERT model. In that case, the answer lies in the size of the training data.
I am wondering why do you prefer to train BERT from scratch instead of fine-tuning it? Is it because you are afraid of the domain adaptation issue? If not, pre-trained BERT is perhaps a better starting point.
Please note, if you want to train BERT from scratch, you may consider a smaller architecture. You may find the following papers useful.
QUESTION
I work in a legacy corporate setting where I only have 16 core 64GB VM to work with on an NLP project. I have a multi-label NLP text classification problem where I would really like to utilize a deep representation learning model like BERT, RoBERTa, ALBERT, etc.
I have approximately 200,000 documents that need to be labeled and I have annotated set of about 2,000 to use as the ground truth for training/testing/fine tuning. I also have a much larger volume of domain related documents to use for pre-training. I will need to do the pre-training from scratch most likely, since this in a clinical domain. I am also open to pre-trained models if they might have a chance working with just fine-tuning like Hugging Face, etc..
What models and their implementations that are PyTorch or Keras compatible would folks suggest as a starting point? Or is this a computational non-starter with my existing compute resources?
...ANSWER
Answered 2020-May-14 at 21:05If you want to use your current setup, it will have no problem running a transformer model. You can reduce memory use by reducing the batch size, but at the cost of slower runs.
Alternatively, test your algorithm on google Colab which is free. Then open a GCP account, google will provide $300 dollars of free credits. Use this to create a GPU cloud instance and then run your algorithm there.
You probably want to use Albert or Distilbert from HuggingFace Transformers. Albert and Distilbert are both compute and memory optimized. HuggingFace has lot's of excellent examples.
Rule of thumb you want to avoid Language Model training from scratch. If possible fine tune the language model or better yet skip it and go straight to the training the classifier. Also, HuggingFace and others have MedicalBert, ScienceBert, and other specialized pretrained models.
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install pre-training
You can use pre-training like any standard Python library. You will need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, and git installed. Make sure that your pip, setuptools, and wheel are up to date. When using pip it is generally recommended to install packages in a virtual environment to avoid changes to the system.
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page