lang-detect | detecting the language for a small piece of unicode text | Data Manipulation library
kandi X-RAY | lang-detect Summary
kandi X-RAY | lang-detect Summary
detecting the language for a small piece of unicode text
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
- Detect the similarity of the text .
- Parse command line arguments .
- Returns the next gram in the text .
- Find the first occurrence of c .
- Return content of given URL .
- Return the number of occurrences of c .
- Compute the inner product of a and b .
- Initialize the grammar .
- Return an iterator .
lang-detect Key Features
lang-detect Examples and Code Snippets
Community Discussions
Trending Discussions on lang-detect
QUESTION
Currently i am working on a project using nlp and python. i have content and need to find the language. I am using spacy to detect the language. The libraries are providing only language as English language. i need to find whether it is British or American English? Any suggestions?
I tried with Spacy, NLTK, lang-detect. but this libraries provide only English. but i need to display as en-GB for British and en-US for american.
...ANSWER
Answered 2019-Oct-01 at 09:41You can train your own model. Many geographically specific data on English were collected by University of Leipzig, but it does not include US English. American National Corpus should a free subset that you can use.
A popular library for language langid.py allows training your own model. They have a nice tutorial on github. Their model is based on character tri-gram frequencies, which might not be sufficiently distinctive statistics in this case.
Another option is to train a classifier on top of BERT using e.g., Pytorch and the transormers library. This will surely get very good results, but if you are not experienced with deep learning, it might be actually a lot of work for you.
QUESTION
I have been working on my project Deep Learning Language Detection which is a network with these layers to recognise from 16 programming languages:
And this is the code to produce the network:
...ANSWER
Answered 2017-Nov-03 at 09:21TL;DR: The problem is that your data are not shuffled before being split into training and validation sets. Therefore, during training, all samples belonging to class "sql" are in the validation set. Your model won't learn to predict the last class if it hasn't been given samples in that class.
In get_input_and_labels()
, the files for class 0 are first loaded, and then class 1, and so on. Since you set n_max_files = 2000
, it means that
- The first 2000 (or so, depends on how many files you actually have) entries in
Y
will be of class 0 ("go") - The next 2000 entries will be of class 1 ("csharp")
- ...
- and finally the last 2000 entries will be of the last class ("sql").
Unfortunately, Keras does not shuffle the data before splitting them into training and validation sets. Because validation_split
is set to 0.1 in your code, about the last 3000 samples (which contains all the "sql" samples) will be in the validation set.
If you set validation_split
to a higher value (e.g., 0.2), you'll see more classes scoring 0%:
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install lang-detect
You can use lang-detect like any standard Python library. You will need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, and git installed. Make sure that your pip, setuptools, and wheel are up to date. When using pip it is generally recommended to install packages in a virtual environment to avoid changes to the system.
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page