How to Process Tokenization of Hyphenated Words

by vigneshchennai74 Updated: Jan 31, 2023

Solution Kit

Hyphenated words have a hyphen (-) between two or more parts of the word. These parts of the word are often used to join commonly used words.  

Tokenization is breaking down a piece of text into smaller units called tokens. Tokens are the basic building blocks of a text, and they can be words, phrases, sentences, or even individual characters, depending on the task and the granularity level required. The tokenization of hyphenated words can be tricky, as the hyphen can indicate different things depending on the context and the language. There are various ways to handle hyphenated words during tokenization, and the best method will depend on the specific task and the desired level of granularity. 

Treat the entire word as a single token: It treats the entire word, including the hyphen, as a single token. 
Treat the word as two separate tokens: This method splits the word into two separate tokens, one for each part of the word. 
Treat the hyphen as a separate token: This method treats the hyphen as a separate token.

You may have a look at the code below for more information about Tokenization of hyphenated words.

Preview of the output that you will get on running this code from your IDE

Code

In this solution we have used Tokenizer function of NLTK.

spaCy - Tokenization of Hyphenated words

PythonLines of Code : 37License : Strong Copyleft (CC BY-SA 4.0)

Dependent Libraries :

import spacy
from spacy.tokenizer import Tokenizer
from spacy.lang.char_classes import ALPHA, ALPHA_LOWER, ALPHA_UPPER, CONCAT_QUOTES, LIST_ELLIPSES, LIST_ICONS
from spacy.util import compile_infix_regex

def custom_tokenizer(nlp):
    infixes = (
        LIST_ELLIPSES
        + LIST_ICONS
        + [
            r"(?<=[0-9])[+\-\*^](?=[0-9-])",
            r"(?<=[{al}{q}])\.(?=[{au}{q}])".format(
                al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
            ),
            r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
            #r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
            r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),
        ]
    )

    infix_re = compile_infix_regex(infixes)

    return Tokenizer(nlp.vocab, prefix_search=nlp.tokenizer.prefix_search,
                                suffix_search=nlp.tokenizer.suffix_search,
                                infix_finditer=infix_re.finditer,
                                token_match=nlp.tokenizer.token_match,
                                rules=nlp.Defaults.tokenizer_exceptions)


nlp = spacy.load("en")
nlp.tokenizer = custom_tokenizer(nlp)
print([t.text for t in nlp("It's 1.50, up-scaled haven't")])
# ['It', "'s", "'", '1.50', "'", ',', 'up-scaled', 'have', "n't"]

nlp = spacy.load('en')
nlp.tokenizer.infix_finditer = infix_re.finditer

Copy the code using the "Copy" button above, and paste it in a Python file in your IDE.
Enter the Text
Run the file to Tokenize the Hyphenated words

I hope you found this useful. I have added the link to dependent libraries, version information in the following sections.

I found this code snippet by searching for "Tokenization of Hyphenated Words" in kandi. You can try any such use case!

Note

In this snippet we are using a Language model (en_core_web_sm)

Download the model using the command python -m spacy download en_core_web_sm .
paste it in your terminal and download it.

Check the user's spacy version using pip show spacy command in users terminal.

if its version 3.0, you will need to load it using nlp = spacy.load("en_core_web_sm")
if its version is less than 3.0 you will need to load it using nlp = spacy.load("en")

Environment Tested

I tested this solution in the following versions. Be mindful of changes when working with other versions.

The solution is created in Python 3.7.15 version.
The solution is tested on Spacy 3.4.3 version.

Using this solution, we are able to Tokenize the Hyphenated words in Python with simple steps. This process also facilities an easy to use, hassle free method to create a hands-on working version of code which would help us Tokenize the words in Python.

Dependent Library

spaCyby explosion

Python

26383

Version:v3.2.6

License: Permissive (MIT)

💫 Industrial-strength Natural Language Processing (NLP) in Python

Support

Quality

Security

License

Reuse

spaCyby explosion

Python 26383 Version:v3.2.6 License: Permissive (MIT)

💫 Industrial-strength Natural Language Processing (NLP) in Python

Support

Quality

Security

License

Reuse

If you do not have SpaCy that is required to run this code, you can install it by clicking on the above link and copying the pip Install command from the Spacy page in kandi.

You can search for any dependent library on kandi like Spacy

Support

For any support on kandi solution kits, please use the chat
For further learning resources, visit the Open Weaver Community learning page

See similar Kits and Libraries

Open Weaver – Develop Applications Faster with Open Source

Terms
Privacy policy

How to Process Tokenization of Hyphenated Words

Code

Environment Tested

Dependent Library

Support

Open Weaver – Develop Applications Faster with Open Source

kandi

Community and Support

Company

Follow