kandi background
Explore Kits

nltk | NLTK the Natural Language Toolkit | Natural Language Processing library

 by   nltk Python Version: Current License: Apache-2.0

 by   nltk Python Version: Current License: Apache-2.0

Download this library from

kandi X-RAY | nltk Summary

nltk is a Python library typically used in Artificial Intelligence, Natural Language Processing applications. nltk has no bugs, it has build file available, it has a Permissive License and it has medium support. However nltk has 4 vulnerabilities. You can download it from GitHub.
NLTK -- the Natural Language Toolkit -- is a suite of open source Python modules, data sets, and tutorials supporting research and development in Natural Language Processing. NLTK requires Python version 3.7, 3.8, 3.9 or 3.10. For documentation, please visit nltk.org.
Support
Support
Quality
Quality
Security
Security
License
License
Reuse
Reuse

kandi-support Support

  • nltk has a medium active ecosystem.
  • It has 10427 star(s) with 2545 fork(s). There are 472 watchers for this library.
  • It had no major release in the last 12 months.
  • There are 203 open issues and 1378 have been closed. On average issues are closed in 130 days. There are 8 open pull requests and 0 closed requests.
  • It has a neutral sentiment in the developer community.
  • The latest version of nltk is current.
nltk Support
Best in #Natural Language Processing
Average in #Natural Language Processing
nltk Support
Best in #Natural Language Processing
Average in #Natural Language Processing

quality kandi Quality

  • nltk has no bugs reported.
nltk Quality
Best in #Natural Language Processing
Average in #Natural Language Processing
nltk Quality
Best in #Natural Language Processing
Average in #Natural Language Processing

securitySecurity

  • nltk has 4 vulnerability issues reported (0 critical, 4 high, 0 medium, 0 low).
nltk Security
Best in #Natural Language Processing
Average in #Natural Language Processing
nltk Security
Best in #Natural Language Processing
Average in #Natural Language Processing

license License

  • nltk is licensed under the Apache-2.0 License. This license is Permissive.
  • Permissive licenses have the least restrictions, and you can use them in most projects.
nltk License
Best in #Natural Language Processing
Average in #Natural Language Processing
nltk License
Best in #Natural Language Processing
Average in #Natural Language Processing

buildReuse

  • nltk releases are not available. You will need to build from source code and install.
  • Build file is available. You can build the component from source.
  • Installation instructions are not available. Examples and code snippets are available.
nltk Reuse
Best in #Natural Language Processing
Average in #Natural Language Processing
nltk Reuse
Best in #Natural Language Processing
Average in #Natural Language Processing
Top functions reviewed by kandi - BETA

kandi has reviewed nltk and discovered the below as its top functions. This is intended to give you an instant insight into nltk implemented functionality, and help decide if they suit your requirements.

  • Train the model .
  • Process relation relations .
  • Generate node coordinates for node .
  • Perform a postag regression on the model .
  • Create a LU for the given function .
  • returns a list of words
  • Compute the BLEU score .
  • Train a hidden Markov model .
  • Example demo .
  • Find a jar file for the given name pattern .

nltk Key Features

NLTK Source

Citing

copy iconCopydownload iconDownload
Bird, Steven, Edward Loper and Ewan Klein (2009).
Natural Language Processing with Python.  O'Reilly Media Inc.

Pandas - Keyword count by Category

copy iconCopydownload iconDownload
df["Text"] = (
    df["Text"]
    .str.lower()
    .replace([r'\|', RE_stopwords], [' ', ''], regex=True)
    .str.strip()
    # .str.cat(sep=' ')
    .str.split()  # Previously .split()
)
  Category          Text
0      Red        [good]
1      Red        [good]
2     Blue  [dont, like]
3   Yellow        [stop]
4     Blue  [dont, like]
df.explode("Text").groupby(["Category", "Text"]).size()
Category  Text
Blue      dont    2
          like    2
Red       good    2
Yellow    stop    1
-----------------------
df["Text"] = (
    df["Text"]
    .str.lower()
    .replace([r'\|', RE_stopwords], [' ', ''], regex=True)
    .str.strip()
    # .str.cat(sep=' ')
    .str.split()  # Previously .split()
)
  Category          Text
0      Red        [good]
1      Red        [good]
2     Blue  [dont, like]
3   Yellow        [stop]
4     Blue  [dont, like]
df.explode("Text").groupby(["Category", "Text"]).size()
Category  Text
Blue      dont    2
          like    2
Red       good    2
Yellow    stop    1
-----------------------
df["Text"] = (
    df["Text"]
    .str.lower()
    .replace([r'\|', RE_stopwords], [' ', ''], regex=True)
    .str.strip()
    # .str.cat(sep=' ')
    .str.split()  # Previously .split()
)
  Category          Text
0      Red        [good]
1      Red        [good]
2     Blue  [dont, like]
3   Yellow        [stop]
4     Blue  [dont, like]
df.explode("Text").groupby(["Category", "Text"]).size()
Category  Text
Blue      dont    2
          like    2
Red       good    2
Yellow    stop    1
-----------------------
df["Text"] = (
    df["Text"]
    .str.lower()
    .replace([r'\|', RE_stopwords], [' ', ''], regex=True)
    .str.strip()
    # .str.cat(sep=' ')
    .str.split()  # Previously .split()
)
  Category          Text
0      Red        [good]
1      Red        [good]
2     Blue  [dont, like]
3   Yellow        [stop]
4     Blue  [dont, like]
df.explode("Text").groupby(["Category", "Text"]).size()
Category  Text
Blue      dont    2
          like    2
Red       good    2
Yellow    stop    1

Import numpy can't be resolved ERROR When I already have numpy installed

copy iconCopydownload iconDownload
pip install virtualenv
mkdir python-virtual-environments && cd python-virtual-environments
python3 -m venv env
source env/bin/activate    # activate venv

How to Capitalize Locations in a List Python

copy iconCopydownload iconDownload
from nltk import word_tokenize, pos_tag, ne_chunk

sentence = "In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."

# Tokenize str -> List[str]
tok_sent = word_tokenize(sentence)
# Tag List[str] -> List[Tuple[str, str]]
pos_sent = pos_tag(tok_sent)
print(pos_sent)
# Chunk this tagged data
tree_sent = ne_chunk(pos_sent)
# This returns a Tree, which we pretty-print
tree_sent.pprint()

locations = []
# All subtrees at height 2 will be our named entities
for named_entity in tree_sent.subtrees(lambda t: t.height() == 2):
    # Extract named entity type and the chunk
    ne_type = named_entity.label()
    chunk = " ".join([tagged[0] for tagged in named_entity.leaves()])
    print(ne_type, chunk)
    if ne_type == "GPE":
        locations.append(chunk)

print(locations)
# pos_tag output:
[('In', 'IN'), ('the', 'DT'), ('wake', 'NN'), ('of', 'IN'), ('a', 'DT'), ('string', 'NN'), ('of', 'IN'), ('abuses', 'NNS'), ('by', 'IN'), ('New', 'NNP'), ('York', 'NNP'), ('police', 'NN'), ('officers', 'NNS'), ('in', 'IN'), ('the', 'DT'), ('1990s', 'CD'), (',', ','), ('Loretta', 'NNP'), ('E.', 'NNP'), ('Lynch', 'NNP'), (',', ','), ('the', 'DT'), ('top', 'JJ'), ('federal', 'JJ'), ('prosecutor', 'NN'), ('in', 'IN'), ('Brooklyn', 'NNP'), (',', ','), ('spoke', 'VBD'), ('forcefully', 'RB'), ('about', 'IN'), ('the', 'DT'), ('pain', 'NN'), ('of', 'IN'), ('a', 'DT'), ('broken', 'JJ'), ('trust', 'NN'), ('that', 'IN'), ('African-Americans', 'NNP'), ('felt', 'VBD'), ('and', 'CC'), ('said', 'VBD'), ('the', 'DT'), ('responsibility', 'NN'), ('for', 'IN'), ('repairing', 'VBG'), ('generations', 'NNS'), ('of', 'IN'), ('miscommunication', 'NN'), ('and', 'CC'), ('mistrust', 'NN'), ('fell', 'VBD'), ('to', 'TO'), ('law', 'NN'), ('enforcement', 'NN'), ('.', '.')]
# ne_chunk output:
(S
  In/IN
  the/DT
  wake/NN
  of/IN
  a/DT
  string/NN
  of/IN
  abuses/NNS
  by/IN
  (GPE New/NNP York/NNP)
  police/NN
  officers/NNS
  in/IN
  the/DT
  1990s/CD
  ,/,
  (PERSON Loretta/NNP E./NNP Lynch/NNP)
  ,/,
  the/DT
  top/JJ
  federal/JJ
  prosecutor/NN
  in/IN
  (GPE Brooklyn/NNP)
  ,/,
  spoke/VBD
  forcefully/RB
  about/IN
  the/DT
  pain/NN
  of/IN
  a/DT
  broken/JJ
  trust/NN
  that/IN
  African-Americans/NNP
  felt/VBD
  and/CC
  said/VBD
  the/DT
  responsibility/NN
  for/IN
  repairing/VBG
  generations/NNS
  of/IN
  miscommunication/NN
  and/CC
  mistrust/NN
  fell/VBD
  to/TO
  law/NN
  enforcement/NN
  ./.)
# All entities found
GPE New York
PERSON Loretta E. Lynch
GPE Brooklyn
# All GPE (Geo-Political Entity)
['New York', 'Brooklyn']
import spacy
import en_core_web_sm
from pprint import pprint

sentence = "In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."
nlp = en_core_web_sm.load()

doc = nlp(sentence)
pprint([(X.text, X.label_) for X in doc.ents])
# Then, we can take only `GPE`:
print([X.text for X in doc.ents if X.label_ == "GPE"])
[('New York', 'GPE'),
 ('the 1990s', 'DATE'),
 ('Loretta E. Lynch', 'PERSON'),
 ('Brooklyn', 'GPE'),
 ('African-Americans', 'NORP')]
['New York', 'Brooklyn']
[('new york', 'GPE'),
 ('the 1990s', 'DATE'),
 ('loretta e. lynch', 'PERSON'),
 ('brooklyn', 'GPE'),
 ('african-americans', 'NORP')]
['new york', 'brooklyn']
-----------------------
from nltk import word_tokenize, pos_tag, ne_chunk

sentence = "In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."

# Tokenize str -> List[str]
tok_sent = word_tokenize(sentence)
# Tag List[str] -> List[Tuple[str, str]]
pos_sent = pos_tag(tok_sent)
print(pos_sent)
# Chunk this tagged data
tree_sent = ne_chunk(pos_sent)
# This returns a Tree, which we pretty-print
tree_sent.pprint()

locations = []
# All subtrees at height 2 will be our named entities
for named_entity in tree_sent.subtrees(lambda t: t.height() == 2):
    # Extract named entity type and the chunk
    ne_type = named_entity.label()
    chunk = " ".join([tagged[0] for tagged in named_entity.leaves()])
    print(ne_type, chunk)
    if ne_type == "GPE":
        locations.append(chunk)

print(locations)
# pos_tag output:
[('In', 'IN'), ('the', 'DT'), ('wake', 'NN'), ('of', 'IN'), ('a', 'DT'), ('string', 'NN'), ('of', 'IN'), ('abuses', 'NNS'), ('by', 'IN'), ('New', 'NNP'), ('York', 'NNP'), ('police', 'NN'), ('officers', 'NNS'), ('in', 'IN'), ('the', 'DT'), ('1990s', 'CD'), (',', ','), ('Loretta', 'NNP'), ('E.', 'NNP'), ('Lynch', 'NNP'), (',', ','), ('the', 'DT'), ('top', 'JJ'), ('federal', 'JJ'), ('prosecutor', 'NN'), ('in', 'IN'), ('Brooklyn', 'NNP'), (',', ','), ('spoke', 'VBD'), ('forcefully', 'RB'), ('about', 'IN'), ('the', 'DT'), ('pain', 'NN'), ('of', 'IN'), ('a', 'DT'), ('broken', 'JJ'), ('trust', 'NN'), ('that', 'IN'), ('African-Americans', 'NNP'), ('felt', 'VBD'), ('and', 'CC'), ('said', 'VBD'), ('the', 'DT'), ('responsibility', 'NN'), ('for', 'IN'), ('repairing', 'VBG'), ('generations', 'NNS'), ('of', 'IN'), ('miscommunication', 'NN'), ('and', 'CC'), ('mistrust', 'NN'), ('fell', 'VBD'), ('to', 'TO'), ('law', 'NN'), ('enforcement', 'NN'), ('.', '.')]
# ne_chunk output:
(S
  In/IN
  the/DT
  wake/NN
  of/IN
  a/DT
  string/NN
  of/IN
  abuses/NNS
  by/IN
  (GPE New/NNP York/NNP)
  police/NN
  officers/NNS
  in/IN
  the/DT
  1990s/CD
  ,/,
  (PERSON Loretta/NNP E./NNP Lynch/NNP)
  ,/,
  the/DT
  top/JJ
  federal/JJ
  prosecutor/NN
  in/IN
  (GPE Brooklyn/NNP)
  ,/,
  spoke/VBD
  forcefully/RB
  about/IN
  the/DT
  pain/NN
  of/IN
  a/DT
  broken/JJ
  trust/NN
  that/IN
  African-Americans/NNP
  felt/VBD
  and/CC
  said/VBD
  the/DT
  responsibility/NN
  for/IN
  repairing/VBG
  generations/NNS
  of/IN
  miscommunication/NN
  and/CC
  mistrust/NN
  fell/VBD
  to/TO
  law/NN
  enforcement/NN
  ./.)
# All entities found
GPE New York
PERSON Loretta E. Lynch
GPE Brooklyn
# All GPE (Geo-Political Entity)
['New York', 'Brooklyn']
import spacy
import en_core_web_sm
from pprint import pprint

sentence = "In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."
nlp = en_core_web_sm.load()

doc = nlp(sentence)
pprint([(X.text, X.label_) for X in doc.ents])
# Then, we can take only `GPE`:
print([X.text for X in doc.ents if X.label_ == "GPE"])
[('New York', 'GPE'),
 ('the 1990s', 'DATE'),
 ('Loretta E. Lynch', 'PERSON'),
 ('Brooklyn', 'GPE'),
 ('African-Americans', 'NORP')]
['New York', 'Brooklyn']
[('new york', 'GPE'),
 ('the 1990s', 'DATE'),
 ('loretta e. lynch', 'PERSON'),
 ('brooklyn', 'GPE'),
 ('african-americans', 'NORP')]
['new york', 'brooklyn']
-----------------------
from nltk import word_tokenize, pos_tag, ne_chunk

sentence = "In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."

# Tokenize str -> List[str]
tok_sent = word_tokenize(sentence)
# Tag List[str] -> List[Tuple[str, str]]
pos_sent = pos_tag(tok_sent)
print(pos_sent)
# Chunk this tagged data
tree_sent = ne_chunk(pos_sent)
# This returns a Tree, which we pretty-print
tree_sent.pprint()

locations = []
# All subtrees at height 2 will be our named entities
for named_entity in tree_sent.subtrees(lambda t: t.height() == 2):
    # Extract named entity type and the chunk
    ne_type = named_entity.label()
    chunk = " ".join([tagged[0] for tagged in named_entity.leaves()])
    print(ne_type, chunk)
    if ne_type == "GPE":
        locations.append(chunk)

print(locations)
# pos_tag output:
[('In', 'IN'), ('the', 'DT'), ('wake', 'NN'), ('of', 'IN'), ('a', 'DT'), ('string', 'NN'), ('of', 'IN'), ('abuses', 'NNS'), ('by', 'IN'), ('New', 'NNP'), ('York', 'NNP'), ('police', 'NN'), ('officers', 'NNS'), ('in', 'IN'), ('the', 'DT'), ('1990s', 'CD'), (',', ','), ('Loretta', 'NNP'), ('E.', 'NNP'), ('Lynch', 'NNP'), (',', ','), ('the', 'DT'), ('top', 'JJ'), ('federal', 'JJ'), ('prosecutor', 'NN'), ('in', 'IN'), ('Brooklyn', 'NNP'), (',', ','), ('spoke', 'VBD'), ('forcefully', 'RB'), ('about', 'IN'), ('the', 'DT'), ('pain', 'NN'), ('of', 'IN'), ('a', 'DT'), ('broken', 'JJ'), ('trust', 'NN'), ('that', 'IN'), ('African-Americans', 'NNP'), ('felt', 'VBD'), ('and', 'CC'), ('said', 'VBD'), ('the', 'DT'), ('responsibility', 'NN'), ('for', 'IN'), ('repairing', 'VBG'), ('generations', 'NNS'), ('of', 'IN'), ('miscommunication', 'NN'), ('and', 'CC'), ('mistrust', 'NN'), ('fell', 'VBD'), ('to', 'TO'), ('law', 'NN'), ('enforcement', 'NN'), ('.', '.')]
# ne_chunk output:
(S
  In/IN
  the/DT
  wake/NN
  of/IN
  a/DT
  string/NN
  of/IN
  abuses/NNS
  by/IN
  (GPE New/NNP York/NNP)
  police/NN
  officers/NNS
  in/IN
  the/DT
  1990s/CD
  ,/,
  (PERSON Loretta/NNP E./NNP Lynch/NNP)
  ,/,
  the/DT
  top/JJ
  federal/JJ
  prosecutor/NN
  in/IN
  (GPE Brooklyn/NNP)
  ,/,
  spoke/VBD
  forcefully/RB
  about/IN
  the/DT
  pain/NN
  of/IN
  a/DT
  broken/JJ
  trust/NN
  that/IN
  African-Americans/NNP
  felt/VBD
  and/CC
  said/VBD
  the/DT
  responsibility/NN
  for/IN
  repairing/VBG
  generations/NNS
  of/IN
  miscommunication/NN
  and/CC
  mistrust/NN
  fell/VBD
  to/TO
  law/NN
  enforcement/NN
  ./.)
# All entities found
GPE New York
PERSON Loretta E. Lynch
GPE Brooklyn
# All GPE (Geo-Political Entity)
['New York', 'Brooklyn']
import spacy
import en_core_web_sm
from pprint import pprint

sentence = "In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."
nlp = en_core_web_sm.load()

doc = nlp(sentence)
pprint([(X.text, X.label_) for X in doc.ents])
# Then, we can take only `GPE`:
print([X.text for X in doc.ents if X.label_ == "GPE"])
[('New York', 'GPE'),
 ('the 1990s', 'DATE'),
 ('Loretta E. Lynch', 'PERSON'),
 ('Brooklyn', 'GPE'),
 ('African-Americans', 'NORP')]
['New York', 'Brooklyn']
[('new york', 'GPE'),
 ('the 1990s', 'DATE'),
 ('loretta e. lynch', 'PERSON'),
 ('brooklyn', 'GPE'),
 ('african-americans', 'NORP')]
['new york', 'brooklyn']
-----------------------
from nltk import word_tokenize, pos_tag, ne_chunk

sentence = "In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."

# Tokenize str -> List[str]
tok_sent = word_tokenize(sentence)
# Tag List[str] -> List[Tuple[str, str]]
pos_sent = pos_tag(tok_sent)
print(pos_sent)
# Chunk this tagged data
tree_sent = ne_chunk(pos_sent)
# This returns a Tree, which we pretty-print
tree_sent.pprint()

locations = []
# All subtrees at height 2 will be our named entities
for named_entity in tree_sent.subtrees(lambda t: t.height() == 2):
    # Extract named entity type and the chunk
    ne_type = named_entity.label()
    chunk = " ".join([tagged[0] for tagged in named_entity.leaves()])
    print(ne_type, chunk)
    if ne_type == "GPE":
        locations.append(chunk)

print(locations)
# pos_tag output:
[('In', 'IN'), ('the', 'DT'), ('wake', 'NN'), ('of', 'IN'), ('a', 'DT'), ('string', 'NN'), ('of', 'IN'), ('abuses', 'NNS'), ('by', 'IN'), ('New', 'NNP'), ('York', 'NNP'), ('police', 'NN'), ('officers', 'NNS'), ('in', 'IN'), ('the', 'DT'), ('1990s', 'CD'), (',', ','), ('Loretta', 'NNP'), ('E.', 'NNP'), ('Lynch', 'NNP'), (',', ','), ('the', 'DT'), ('top', 'JJ'), ('federal', 'JJ'), ('prosecutor', 'NN'), ('in', 'IN'), ('Brooklyn', 'NNP'), (',', ','), ('spoke', 'VBD'), ('forcefully', 'RB'), ('about', 'IN'), ('the', 'DT'), ('pain', 'NN'), ('of', 'IN'), ('a', 'DT'), ('broken', 'JJ'), ('trust', 'NN'), ('that', 'IN'), ('African-Americans', 'NNP'), ('felt', 'VBD'), ('and', 'CC'), ('said', 'VBD'), ('the', 'DT'), ('responsibility', 'NN'), ('for', 'IN'), ('repairing', 'VBG'), ('generations', 'NNS'), ('of', 'IN'), ('miscommunication', 'NN'), ('and', 'CC'), ('mistrust', 'NN'), ('fell', 'VBD'), ('to', 'TO'), ('law', 'NN'), ('enforcement', 'NN'), ('.', '.')]
# ne_chunk output:
(S
  In/IN
  the/DT
  wake/NN
  of/IN
  a/DT
  string/NN
  of/IN
  abuses/NNS
  by/IN
  (GPE New/NNP York/NNP)
  police/NN
  officers/NNS
  in/IN
  the/DT
  1990s/CD
  ,/,
  (PERSON Loretta/NNP E./NNP Lynch/NNP)
  ,/,
  the/DT
  top/JJ
  federal/JJ
  prosecutor/NN
  in/IN
  (GPE Brooklyn/NNP)
  ,/,
  spoke/VBD
  forcefully/RB
  about/IN
  the/DT
  pain/NN
  of/IN
  a/DT
  broken/JJ
  trust/NN
  that/IN
  African-Americans/NNP
  felt/VBD
  and/CC
  said/VBD
  the/DT
  responsibility/NN
  for/IN
  repairing/VBG
  generations/NNS
  of/IN
  miscommunication/NN
  and/CC
  mistrust/NN
  fell/VBD
  to/TO
  law/NN
  enforcement/NN
  ./.)
# All entities found
GPE New York
PERSON Loretta E. Lynch
GPE Brooklyn
# All GPE (Geo-Political Entity)
['New York', 'Brooklyn']
import spacy
import en_core_web_sm
from pprint import pprint

sentence = "In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."
nlp = en_core_web_sm.load()

doc = nlp(sentence)
pprint([(X.text, X.label_) for X in doc.ents])
# Then, we can take only `GPE`:
print([X.text for X in doc.ents if X.label_ == "GPE"])
[('New York', 'GPE'),
 ('the 1990s', 'DATE'),
 ('Loretta E. Lynch', 'PERSON'),
 ('Brooklyn', 'GPE'),
 ('African-Americans', 'NORP')]
['New York', 'Brooklyn']
[('new york', 'GPE'),
 ('the 1990s', 'DATE'),
 ('loretta e. lynch', 'PERSON'),
 ('brooklyn', 'GPE'),
 ('african-americans', 'NORP')]
['new york', 'brooklyn']
-----------------------
from nltk import word_tokenize, pos_tag, ne_chunk

sentence = "In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."

# Tokenize str -> List[str]
tok_sent = word_tokenize(sentence)
# Tag List[str] -> List[Tuple[str, str]]
pos_sent = pos_tag(tok_sent)
print(pos_sent)
# Chunk this tagged data
tree_sent = ne_chunk(pos_sent)
# This returns a Tree, which we pretty-print
tree_sent.pprint()

locations = []
# All subtrees at height 2 will be our named entities
for named_entity in tree_sent.subtrees(lambda t: t.height() == 2):
    # Extract named entity type and the chunk
    ne_type = named_entity.label()
    chunk = " ".join([tagged[0] for tagged in named_entity.leaves()])
    print(ne_type, chunk)
    if ne_type == "GPE":
        locations.append(chunk)

print(locations)
# pos_tag output:
[('In', 'IN'), ('the', 'DT'), ('wake', 'NN'), ('of', 'IN'), ('a', 'DT'), ('string', 'NN'), ('of', 'IN'), ('abuses', 'NNS'), ('by', 'IN'), ('New', 'NNP'), ('York', 'NNP'), ('police', 'NN'), ('officers', 'NNS'), ('in', 'IN'), ('the', 'DT'), ('1990s', 'CD'), (',', ','), ('Loretta', 'NNP'), ('E.', 'NNP'), ('Lynch', 'NNP'), (',', ','), ('the', 'DT'), ('top', 'JJ'), ('federal', 'JJ'), ('prosecutor', 'NN'), ('in', 'IN'), ('Brooklyn', 'NNP'), (',', ','), ('spoke', 'VBD'), ('forcefully', 'RB'), ('about', 'IN'), ('the', 'DT'), ('pain', 'NN'), ('of', 'IN'), ('a', 'DT'), ('broken', 'JJ'), ('trust', 'NN'), ('that', 'IN'), ('African-Americans', 'NNP'), ('felt', 'VBD'), ('and', 'CC'), ('said', 'VBD'), ('the', 'DT'), ('responsibility', 'NN'), ('for', 'IN'), ('repairing', 'VBG'), ('generations', 'NNS'), ('of', 'IN'), ('miscommunication', 'NN'), ('and', 'CC'), ('mistrust', 'NN'), ('fell', 'VBD'), ('to', 'TO'), ('law', 'NN'), ('enforcement', 'NN'), ('.', '.')]
# ne_chunk output:
(S
  In/IN
  the/DT
  wake/NN
  of/IN
  a/DT
  string/NN
  of/IN
  abuses/NNS
  by/IN
  (GPE New/NNP York/NNP)
  police/NN
  officers/NNS
  in/IN
  the/DT
  1990s/CD
  ,/,
  (PERSON Loretta/NNP E./NNP Lynch/NNP)
  ,/,
  the/DT
  top/JJ
  federal/JJ
  prosecutor/NN
  in/IN
  (GPE Brooklyn/NNP)
  ,/,
  spoke/VBD
  forcefully/RB
  about/IN
  the/DT
  pain/NN
  of/IN
  a/DT
  broken/JJ
  trust/NN
  that/IN
  African-Americans/NNP
  felt/VBD
  and/CC
  said/VBD
  the/DT
  responsibility/NN
  for/IN
  repairing/VBG
  generations/NNS
  of/IN
  miscommunication/NN
  and/CC
  mistrust/NN
  fell/VBD
  to/TO
  law/NN
  enforcement/NN
  ./.)
# All entities found
GPE New York
PERSON Loretta E. Lynch
GPE Brooklyn
# All GPE (Geo-Political Entity)
['New York', 'Brooklyn']
import spacy
import en_core_web_sm
from pprint import pprint

sentence = "In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."
nlp = en_core_web_sm.load()

doc = nlp(sentence)
pprint([(X.text, X.label_) for X in doc.ents])
# Then, we can take only `GPE`:
print([X.text for X in doc.ents if X.label_ == "GPE"])
[('New York', 'GPE'),
 ('the 1990s', 'DATE'),
 ('Loretta E. Lynch', 'PERSON'),
 ('Brooklyn', 'GPE'),
 ('African-Americans', 'NORP')]
['New York', 'Brooklyn']
[('new york', 'GPE'),
 ('the 1990s', 'DATE'),
 ('loretta e. lynch', 'PERSON'),
 ('brooklyn', 'GPE'),
 ('african-americans', 'NORP')]
['new york', 'brooklyn']

Manually install Open Multilingual Worldnet (NLTK)

copy iconCopydownload iconDownload
nltk_data
+ corpora
  + wordnet
    + adj.exc
    + adv.exc
    + ...
  + omw
    + ...
    + ita
      + citation.bib
      + LICENSE
      + ...
    + ...

tokenize sentence into words python

copy iconCopydownload iconDownload
s='"[\"Jan 31 19:28:14 nginx: 10.0.0.0 - - [31/Jan/2019:19:28:14 +0100] "POST /test/itf/ HTTP/x.x" 404 146 "-" "Mozilla/5.2 [en] (X11, U; OpenVAS-XX 9.2.7)"\"]" i want "Mozilla/5.2 [en] (X11, U; OpenVAS-XX 9.2.7)"'

words = s.split(' ') # break the sentence into spaces
# ['"["Jan', '31', '19:28:14', 'nginx:', '10.0.0.0', '-', '-', '[31/Jan/2019:19:28:14', '+0100]', '"POST', '/test/itf/', 'HTTP/x.x"', '404', '146', '"-"', '"Mozilla/5.2', '[en]', '(X11,', 'U;', 'OpenVAS-XX', '9.2.7)""]"', 'i', 'want', '"Mozilla/5.2', '[en]', '(X11,', 'U;', 'OpenVAS-XX', '9.2.7)"']

# then access your data list
words[0] # '"["Jan'
words[1] # '31'
words[2] # '19:28:14'
-----------------------
import re
s_list = []

def str_partition(text):
    parts = text.partition(" ")
    part = re.sub('[\[\]\"\'\-]', '', parts[0])
    
    if part.startswith("nginx"):
        s_list.append(part.replace(":", ''))
    elif part != "":
        s_list.append(part)
        
    if not parts[2].startswith('"Moz'):
        str_partition(parts[2])
    else:
        part = re.sub('[\"\']', '', parts[2])
        part = part[:-1]
        s_list.append(part)
        return

s = '[\'Jan 31 19:28:14 nginx: 10.0.0.0 - - [31/Jan/2019:19:28:14 +0100] "POST /test/itf/ HTTP/x.x" 404 146 "-" "Mozilla/5.2 [en] (X11, U; OpenVAS-XX 9.2.7)"\']'     
str_partition(s)       
print(s_list)
['Jan', '31', '19:28:14', 'nginx', '10.0.0.0', '31/Jan/2019:19:28:14', '+0100',
'POST', '/test/itf/', 'HTTP/x.x', '404', '146', 'Mozilla/5.2 [en] (X11, U; OpenVAS-XX 9.2.7)']
-----------------------
import re
s_list = []

def str_partition(text):
    parts = text.partition(" ")
    part = re.sub('[\[\]\"\'\-]', '', parts[0])
    
    if part.startswith("nginx"):
        s_list.append(part.replace(":", ''))
    elif part != "":
        s_list.append(part)
        
    if not parts[2].startswith('"Moz'):
        str_partition(parts[2])
    else:
        part = re.sub('[\"\']', '', parts[2])
        part = part[:-1]
        s_list.append(part)
        return

s = '[\'Jan 31 19:28:14 nginx: 10.0.0.0 - - [31/Jan/2019:19:28:14 +0100] "POST /test/itf/ HTTP/x.x" 404 146 "-" "Mozilla/5.2 [en] (X11, U; OpenVAS-XX 9.2.7)"\']'     
str_partition(s)       
print(s_list)
['Jan', '31', '19:28:14', 'nginx', '10.0.0.0', '31/Jan/2019:19:28:14', '+0100',
'POST', '/test/itf/', 'HTTP/x.x', '404', '146', 'Mozilla/5.2 [en] (X11, U; OpenVAS-XX 9.2.7)']
-----------------------
import re

sentences = ['[\'Jan 31 19:28:14 nginx: 10.0.0.0 - - [31/Jan/2019:19:28:14 +0100] "POST /test/itf/ HTTP/x.x" 404 146 "-" "Mozilla/5.2 [en] (X11, U; OpenVAS-XX 9.2.7)"\']']

rx = re.compile(r'\b(\w{3})\s+(\d{1,2})\s+(\d{1,2}:\d{1,2}:\d{2})\s+(\w+)\W+(\d{1,3}(?:\.\d{1,3}){3})(?:\s+\S+){2}\s+\[([^][\s]+)\s+([+\d]+)]\s+"([A-Z]+)\s+(\S+)\s+(\S+)"\s+(\d+)\s+(\d+)\s+\S+\s+"([^"]*)"')

words=[]
for sent in sentences:
    m = rx.search(sent)
    if m:
        words.append(list(m.groups()))
    else:
        words.append(nltk.word_tokenize(sent))

print(words)
[['Jan', '31', '19:28:14', 'nginx', '10.0.0.0', '31/Jan/2019:19:28:14', '+0100', 'POST', '/test/itf/', 'HTTP/x.x', '404', '146', 'Mozilla/5.2 [en] (X11, U; OpenVAS-XX 9.2.7)']]
-----------------------
import re

sentences = ['[\'Jan 31 19:28:14 nginx: 10.0.0.0 - - [31/Jan/2019:19:28:14 +0100] "POST /test/itf/ HTTP/x.x" 404 146 "-" "Mozilla/5.2 [en] (X11, U; OpenVAS-XX 9.2.7)"\']']

rx = re.compile(r'\b(\w{3})\s+(\d{1,2})\s+(\d{1,2}:\d{1,2}:\d{2})\s+(\w+)\W+(\d{1,3}(?:\.\d{1,3}){3})(?:\s+\S+){2}\s+\[([^][\s]+)\s+([+\d]+)]\s+"([A-Z]+)\s+(\S+)\s+(\S+)"\s+(\d+)\s+(\d+)\s+\S+\s+"([^"]*)"')

words=[]
for sent in sentences:
    m = rx.search(sent)
    if m:
        words.append(list(m.groups()))
    else:
        words.append(nltk.word_tokenize(sent))

print(words)
[['Jan', '31', '19:28:14', 'nginx', '10.0.0.0', '31/Jan/2019:19:28:14', '+0100', 'POST', '/test/itf/', 'HTTP/x.x', '404', '146', 'Mozilla/5.2 [en] (X11, U; OpenVAS-XX 9.2.7)']]

How do I turn this oddly formatted looped print function into a data frame with similar output?

copy iconCopydownload iconDownload
L = []
for sent in nltk.sent_tokenize(sentence):
  for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
     if hasattr(chunk, 'label'):
        L.append([chunk.label(), ' '.join(c[0] for c in chunk)])
        
df = pd.DataFrame(L, columns=['a','b'])
print (df)
              a               b
0        PERSON          Martin
1        PERSON     Luther King
2        PERSON    Michael King
3  ORGANIZATION        American
4           GPE        American
5           GPE       Christian
6        PERSON  Mahatma Gandhi
7        PERSON   Martin Luther
L= [[chunk.label(), ' '.join(c[0] for c in chunk)]  
     for sent in nltk.sent_tokenize(sentence) 
     for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))) 
     if hasattr(chunk, 'label')]

df = pd.DataFrame(L, columns=['a','b'])
-----------------------
L = []
for sent in nltk.sent_tokenize(sentence):
  for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
     if hasattr(chunk, 'label'):
        L.append([chunk.label(), ' '.join(c[0] for c in chunk)])
        
df = pd.DataFrame(L, columns=['a','b'])
print (df)
              a               b
0        PERSON          Martin
1        PERSON     Luther King
2        PERSON    Michael King
3  ORGANIZATION        American
4           GPE        American
5           GPE       Christian
6        PERSON  Mahatma Gandhi
7        PERSON   Martin Luther
L= [[chunk.label(), ' '.join(c[0] for c in chunk)]  
     for sent in nltk.sent_tokenize(sentence) 
     for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))) 
     if hasattr(chunk, 'label')]

df = pd.DataFrame(L, columns=['a','b'])

How to get a nested list by stemming the words inside the nested lists?

copy iconCopydownload iconDownload
from nltk.stem import PorterStemmer

tokens = [['cooked', 'lovely','baked'],['hotel', 'going','liked'],['room','looking']]

ps = PorterStemmer()
stemmed = [[ps.stem(word) for word in sublst] for sublst in tokens]

print(stemmed)
# [['cook', 'love', 'bake'], ['hotel', 'go', 'like'], ['room', 'look']]

No module named 'nltk.lm' in Google colaboratory

copy iconCopydownload iconDownload
!pip install -U nltk
...
Downloading nltk-3.6.5-py3-none-any.whl (1.5 MB)
...
Successfully uninstalled nltk-3.2.5
...
You must restart the runtime in order to use newly installed versions.
import nltk
print('The nltk version is {}.'.format(nltk.__version__))
-----------------------
!pip install -U nltk
...
Downloading nltk-3.6.5-py3-none-any.whl (1.5 MB)
...
Successfully uninstalled nltk-3.2.5
...
You must restart the runtime in order to use newly installed versions.
import nltk
print('The nltk version is {}.'.format(nltk.__version__))
-----------------------
!pip install -U nltk
...
Downloading nltk-3.6.5-py3-none-any.whl (1.5 MB)
...
Successfully uninstalled nltk-3.2.5
...
You must restart the runtime in order to use newly installed versions.
import nltk
print('The nltk version is {}.'.format(nltk.__version__))

Pyodide filesystem for NLTK resources : missing files

copy iconCopydownload iconDownload
from js import fetch

response = await fetch("<url>")
js_buffer = await response.arrayBuffer()
py_buffer = js_buffer.to_py()  # this is a memoryview
stream = py_buffer.tobytes()  # now we have a bytes object

# that we can finally write under the appropriate path
with open("<file_path>", "wb") as fh:
    fh.write(stream)
-----------------------
from js import fetch
import nltk
from pathlib import Path
import os, sys, io, zipfile

response = await fetch('https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/tokenizers/punkt.zip')
js_buffer = await response.arrayBuffer()
py_buffer = js_buffer.to_py()  # this is a memoryview
stream = py_buffer.tobytes()  # now we have a bytes object

d = Path("/nltk_data/tokenizers")
d.mkdir(parents=True, exist_ok=True)

Path('/nltk_data/tokenizers/punkt.zip').write_bytes(stream)

# extract punkt.zip
zipfile.ZipFile('/nltk_data/tokenizers/punkt.zip').extractall(
    path='/nltk_data/tokenizers/'
)

# check file contents in /nltk_data/tokenizers/
# print(os.listdir("/nltk_data/tokenizers/punkt"))

nltk.word_tokenize("some text here")

ModuleNotFoundError: No module named '_tkinter' on Jupyter Notebook

copy iconCopydownload iconDownload
 $which python

-> alias python='/usr/bin/python3.7'
    /usr/bin/python3.7
$jupyter kernelspec list
-> Available kernels:
    python3    /home/Natko/.local/share/jupyter/kernels/python3
$nano  /home/Natko/.local/share/jupyter/kernels/python3/kernel.json 
{
 "argv": [
  "python",
  "-m",
  "ipykernel_launcher",
  "-f",
  "{connection_file}"
 ],
 "display_name": "Python 3",
 "language": "python",
}
{
 "argv": [
  "python",
  "-m",
  "ipykernel_launcher",
  "-f",
  "{connection_file}"
 ],
 "display_name": "Python 3",
 "language": "python",
 "env": {
     "PYTHONPATH": "/usr/bin/"
 }
}
-----------------------
 $which python

-> alias python='/usr/bin/python3.7'
    /usr/bin/python3.7
$jupyter kernelspec list
-> Available kernels:
    python3    /home/Natko/.local/share/jupyter/kernels/python3
$nano  /home/Natko/.local/share/jupyter/kernels/python3/kernel.json 
{
 "argv": [
  "python",
  "-m",
  "ipykernel_launcher",
  "-f",
  "{connection_file}"
 ],
 "display_name": "Python 3",
 "language": "python",
}
{
 "argv": [
  "python",
  "-m",
  "ipykernel_launcher",
  "-f",
  "{connection_file}"
 ],
 "display_name": "Python 3",
 "language": "python",
 "env": {
     "PYTHONPATH": "/usr/bin/"
 }
}
-----------------------
 $which python

-> alias python='/usr/bin/python3.7'
    /usr/bin/python3.7
$jupyter kernelspec list
-> Available kernels:
    python3    /home/Natko/.local/share/jupyter/kernels/python3
$nano  /home/Natko/.local/share/jupyter/kernels/python3/kernel.json 
{
 "argv": [
  "python",
  "-m",
  "ipykernel_launcher",
  "-f",
  "{connection_file}"
 ],
 "display_name": "Python 3",
 "language": "python",
}
{
 "argv": [
  "python",
  "-m",
  "ipykernel_launcher",
  "-f",
  "{connection_file}"
 ],
 "display_name": "Python 3",
 "language": "python",
 "env": {
     "PYTHONPATH": "/usr/bin/"
 }
}
-----------------------
 $which python

-> alias python='/usr/bin/python3.7'
    /usr/bin/python3.7
$jupyter kernelspec list
-> Available kernels:
    python3    /home/Natko/.local/share/jupyter/kernels/python3
$nano  /home/Natko/.local/share/jupyter/kernels/python3/kernel.json 
{
 "argv": [
  "python",
  "-m",
  "ipykernel_launcher",
  "-f",
  "{connection_file}"
 ],
 "display_name": "Python 3",
 "language": "python",
}
{
 "argv": [
  "python",
  "-m",
  "ipykernel_launcher",
  "-f",
  "{connection_file}"
 ],
 "display_name": "Python 3",
 "language": "python",
 "env": {
     "PYTHONPATH": "/usr/bin/"
 }
}

Community Discussions

Trending Discussions on nltk
  • Pandas - Keyword count by Category
  • Import numpy can't be resolved ERROR When I already have numpy installed
  • How to Capitalize Locations in a List Python
  • Manually install Open Multilingual Worldnet (NLTK)
  • tokenize sentence into words python
  • Convert words between part of speech, when wordnet doesn't do it
  • How do I turn this oddly formatted looped print function into a data frame with similar output?
  • Sagemaker Serverless Inference &amp; custom container: Model archiver subprocess fails
  • How to get a nested list by stemming the words inside the nested lists?
  • No module named 'nltk.lm' in Google colaboratory
Trending Discussions on nltk

QUESTION

Pandas - Keyword count by Category

Asked 2022-Apr-04 at 13:41

I am trying to get a count of the most occurring words in my df, grouped by another Columns values:

I have a dataframe like so:

df=pd.DataFrame({'Category':['Red','Red','Blue','Yellow','Blue'],'Text':['this is very good ','good','dont like','stop','dont like']})

enter image description here

This is the way that I have counted the keywords in the Text column:

from collections import Counter

top_N = 100


stopwords = nltk.corpus.stopwords.words('english')
# # RegEx for stopwords
RE_stopwords = r'\b(?:{})\b'.format('|'.join(stopwords))
# replace '|'-->' ' and drop all stopwords
words = (df.Text
           .str.lower()
           .replace([r'\|', RE_stopwords], [' ', ''], regex=True)
           .str.cat(sep=' ')
           .split()
)

# generate DF out of Counter
df_top_words = pd.DataFrame(Counter(words).most_common(top_N),
                    columns=['Word', 'Frequency']).set_index('Word')
print(df_top_words)

Which produces this result:

However this just generates a list of all of the words in the data frame, what I am after is something along the lines of this:

ANSWER

Answered 2022-Apr-04 at 13:11

Your words statement finds the words that you care about (removing stopwords) in the text of the whole column. We can change that a bit to apply the replacement on each row instead:

df["Text"] = (
    df["Text"]
    .str.lower()
    .replace([r'\|', RE_stopwords], [' ', ''], regex=True)
    .str.strip()
    # .str.cat(sep=' ')
    .str.split()  # Previously .split()
)

Resulting in:

  Category          Text
0      Red        [good]
1      Red        [good]
2     Blue  [dont, like]
3   Yellow        [stop]
4     Blue  [dont, like]

Now, we can use .explode and then .groupby and .size to expand each list element to its own row and then count how many times does a word appear in the text of each (original) row:

df.explode("Text").groupby(["Category", "Text"]).size()

Resulting in:

Category  Text
Blue      dont    2
          like    2
Red       good    2
Yellow    stop    1

Now, this does not match your output sample because in that sample you're not applying the .replace step from the original words statement (now used to calculate the new value of the "Text" column). If you wanted that result, you just have to comment out that .replace line (but I guess that's the whole point of this question)

Source https://stackoverflow.com/questions/71737328

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install nltk

You can download it from GitHub.
You can use nltk like any standard Python library. You will need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, and git installed. Make sure that your pip, setuptools, and wheel are up to date. When using pip it is generally recommended to install packages in a virtual environment to avoid changes to the system.

Support

Do you want to contribute to NLTK development? Great! Please read CONTRIBUTING.md for more details. See also how to contribute to NLTK.

DOWNLOAD this Library from

Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from
over 430 million Knowledge Items
Find more libraries
Reuse Solution Kits and Libraries Curated by Popular Use Cases

Save this library and start creating your kit

Share this Page

share link
Consider Popular Natural Language Processing Libraries
Compare Natural Language Processing Libraries with Highest Support
Compare Natural Language Processing Libraries with Highest Quality
Compare Natural Language Processing Libraries with Highest Security
Compare Natural Language Processing Libraries with Permissive License
Compare Natural Language Processing Libraries with Highest Reuse
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from
over 430 million Knowledge Items
Find more libraries
Reuse Solution Kits and Libraries Curated by Popular Use Cases

Save this library and start creating your kit

  • © 2022 Open Weaver Inc.