fast_align | Simple , fast unsupervised word aligner | Natural Language Processing library

by clab C++ Version: Current License: Apache-2.0

X-Ray Key Features Code Snippets Community Discussions(4)Vulnerabilities Install Support

kandi X-RAY | fast_align Summary

fast_align is a C++ library typically used in Artificial Intelligence, Natural Language Processing applications. fast_align has no bugs, it has no vulnerabilities, it has a Permissive License and it has low support. You can download it from GitHub.

Simple, fast unsupervised word aligner

Support

Quality

Security

License

Reuse

Support

fast_align has a low active ecosystem.

It has 675 star(s) with 153 fork(s). There are 27 watchers for this library.

It had no major release in the last 6 months.

There are 23 open issues and 15 have been closed. On average issues are closed in 134 days. There are 15 open pull requests and 0 closed requests.

It has a neutral sentiment in the developer community.

The latest version of fast_align is current.

Quality

fast_align has 0 bugs and 0 code smells.

Security

fast_align has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

fast_align code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

fast_align is licensed under the Apache-2.0 License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

fast_align releases are not available. You will need to build from source code and install.

Installation instructions are not available. Examples and code snippets are available.

Top functions reviewed by kandi - BETA

kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of fast_align

Get all kandi verified functions for this library.

fast_align Key Features

No Key Features are available at this moment for fast_align.

fast_align Examples and Code Snippets

No Code Snippets are available at this moment for fast_align.

Community Discussions

Trending Discussions on fast_align

How do I interpret the alignment score from the alignment tool fast_align?

Docker build from Dockerfile with more memory

Converting unicode sequence to string in Python3 but allow paths in string

When using word alignment tools like fast_align, does more sentences mean better accuracy?

QUESTION

How do I interpret the alignment score from the alignment tool fast_align?

Asked 2019-Oct-09 at 08:31

I'm using the alignment toolkit fast_align: https://github.com/clab/fast_align, to get word-to-word alignment of a parallel corpus. There is an option to print out the alignment score -- how do I interpret this score? Does the score measure the degree of alignment between the parallel sentences? I know that some of the sentences in the corpus are well aligned and others are not, but so far I see no correlation between the score and how well aligned they are. Should I adjust for the number of words in the sentence?

...

ANSWER

Answered 2019-Oct-09 at 08:31

FastAlign is an implementation of IBM Model 2, the score is the probability estimated by this model. The details of the model are very nicely explained in these slides from JHU.

The score is a probability of the source sentence given the target sentence words and the alignment. The algorithm iteratively estimates:

The probabilities of being each other translation for (virtually all) pairs of the source language and the target language pairs.
Optimal alignment given the word-to-word translation probabilities.

The score is then a product of the word-to-word translation probabilities with the alignment the algorithm converged to. So, in theory, this should correlate with how parallel the sentences are, but there are so many ways in which this can break. For instance, rare words have unreliable probability estimates. Another problem might be some words (such as "of") can be part of multi-word expressions that are a single word in other languages, which skews the probability estimates as well. So, there is no wonder that the probability is not to be trusted.

If your goal is to filter the parallel corpus and remove the incorrectly aligned sentence pairs, I would recommend something else. You can e.g., use Multilingual BERT as they did in a paper by Google, where they the centered vectors for cross-lingual retrieval. Or just google "parallel corpus filtering."

Source https://stackoverflow.com/questions/58292601

QUESTION

Docker build from Dockerfile with more memory

Asked 2019-Aug-29 at 19:01

How to docker build from Dockerfile with more memory?

This is a different question from this Allow more memory when docker build a Dockerfile

When installing the software natively, there is enough memory to successfully build and install the marian tool

But when building the Docker image using the Dockerfile https://github.com/marian-nmt/marian/blob/master/scripts/docker/Dockerfile.cpu , it fails with multiple memory exhausted errors

...

ANSWER

Answered 2019-Aug-29 at 19:01

It is not something about order. The Dockerfile must be specified with -f

Source https://stackoverflow.com/questions/45363771

QUESTION

Converting unicode sequence to string in Python3 but allow paths in string

Asked 2018-Sep-21 at 18:43

There is at least one related question on SO that proved useful when trying to decode unicode sequences.

I am preprocessing a lot of texts with a lot of different genres. Some are economical, some are technical, and so on. One of the caveats is converting unicode sequences:

...

ANSWER

Answered 2018-Sep-20 at 13:58

The raw_unicode_escape codec in the ignore mode seems to do the trick. I'm inlining the input as a raw byte longstring here, which should by my reasoning be equivalent to reading it from a binary file.

Source https://stackoverflow.com/questions/52425315

QUESTION

When using word alignment tools like fast_align, does more sentences mean better accuracy?

Asked 2017-Aug-08 at 08:54

I am using fast_align https://github.com/clab/fast_align to get word alignments between 1000 German sentences and 1000 English translations of those sentences. So far the quality is not so good.

Would throwing more sentences into the process help fast_align to be more accurate? Say I take some OPUS data with 100k aligned sentence pairs and then add my 1000 sentences in the end of it and feed it to fast_align. Will that help? I can't seem to find any info on whether this would make sense.

...

ANSWER

Answered 2017-Aug-08 at 08:54

[Disclaimer: I know next to nothing about alignment and have not used fast_align.]

Yes.

You can prove this to yourself and also plot the accuracy/scale curve by removing data from your dataset to try it at at even lower scale.

That said, 1000 is already absurdly low, for these purposes 1000 ≈≈ 0, and I would not expect it to work.

More ideal would be to try 10K, 100K and 1M. More comparable to others' results would be some standard corpus, eg Wikipedia or data from the research workshops.

Adding data very different than the data that is important to you can have mixed results, but in this case more data can hardly hurt. We could be more helpful with suggestions if you mention a specific domain, dataset or goal.

Source https://stackoverflow.com/questions/45431399

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install fast_align

You can download it from GitHub.

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: