fast_align | Simple , fast unsupervised word aligner | Natural Language Processing library
kandi X-RAY | fast_align Summary
kandi X-RAY | fast_align Summary
Simple, fast unsupervised word aligner
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of fast_align
fast_align Key Features
fast_align Examples and Code Snippets
Community Discussions
Trending Discussions on fast_align
QUESTION
I'm using the alignment toolkit fast_align: https://github.com/clab/fast_align, to get word-to-word alignment of a parallel corpus. There is an option to print out the alignment score -- how do I interpret this score? Does the score measure the degree of alignment between the parallel sentences? I know that some of the sentences in the corpus are well aligned and others are not, but so far I see no correlation between the score and how well aligned they are. Should I adjust for the number of words in the sentence?
...ANSWER
Answered 2019-Oct-09 at 08:31FastAlign is an implementation of IBM Model 2, the score is the probability estimated by this model. The details of the model are very nicely explained in these slides from JHU.
The score is a probability of the source sentence given the target sentence words and the alignment. The algorithm iteratively estimates:
- The probabilities of being each other translation for (virtually all) pairs of the source language and the target language pairs.
- Optimal alignment given the word-to-word translation probabilities.
The score is then a product of the word-to-word translation probabilities with the alignment the algorithm converged to. So, in theory, this should correlate with how parallel the sentences are, but there are so many ways in which this can break. For instance, rare words have unreliable probability estimates. Another problem might be some words (such as "of") can be part of multi-word expressions that are a single word in other languages, which skews the probability estimates as well. So, there is no wonder that the probability is not to be trusted.
If your goal is to filter the parallel corpus and remove the incorrectly aligned sentence pairs, I would recommend something else. You can e.g., use Multilingual BERT as they did in a paper by Google, where they the centered vectors for cross-lingual retrieval. Or just google "parallel corpus filtering."
QUESTION
How to docker build
from Dockerfile with more memory?
This is a different question from this Allow more memory when docker build a Dockerfile
When installing the software natively, there is enough memory to successfully build and install the marian
tool
But when building the Docker image using the Dockerfile
https://github.com/marian-nmt/marian/blob/master/scripts/docker/Dockerfile.cpu , it fails with multiple memory exhausted errors
ANSWER
Answered 2019-Aug-29 at 19:01It is not something about order. The Dockerfile must be specified with -f
QUESTION
There is at least one related question on SO that proved useful when trying to decode unicode sequences.
I am preprocessing a lot of texts with a lot of different genres. Some are economical, some are technical, and so on. One of the caveats is converting unicode sequences:
...ANSWER
Answered 2018-Sep-20 at 13:58The raw_unicode_escape
codec in the ignore
mode seems to do the trick. I'm inlining the input as a raw byte longstring here, which should by my reasoning be equivalent to reading it from a binary file.
QUESTION
I am using fast_align https://github.com/clab/fast_align to get word alignments between 1000 German sentences and 1000 English translations of those sentences. So far the quality is not so good.
Would throwing more sentences into the process help fast_align to be more accurate? Say I take some OPUS data with 100k aligned sentence pairs and then add my 1000 sentences in the end of it and feed it to fast_align. Will that help? I can't seem to find any info on whether this would make sense.
...ANSWER
Answered 2017-Aug-08 at 08:54[Disclaimer: I know next to nothing about alignment and have not used fast_align.]
Yes.
You can prove this to yourself and also plot the accuracy/scale curve by removing data from your dataset to try it at at even lower scale.
That said, 1000 is already absurdly low, for these purposes 1000 ≈≈ 0, and I would not expect it to work.
More ideal would be to try 10K, 100K and 1M. More comparable to others' results would be some standard corpus, eg Wikipedia or data from the research workshops.
Adding data very different than the data that is important to you can have mixed results, but in this case more data can hardly hurt. We could be more helpful with suggestions if you mention a specific domain, dataset or goal.
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install fast_align
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page