Explore all natural language processing open source software, libraries, packages, source code, cloud functions and APIs.

Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The result is a computer capable of 'understanding' the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.

Popular New Releases in Natural Language Processing

transformers

v4.18.0: Checkpoint sharding, vision models

HanLP

v1.8.2 常规维护与准确率提升

spaCy

v3.1.6: Workaround for Click/Typer issues

flair

Release 0.11

allennlp

v2.9.2

Popular Libraries in Natural Language Processing

transformers

by huggingface doticonpythondoticon

star image 61400 doticonApache-2.0

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

funNLP

by fighting41love doticonpythondoticon

star image 33333 doticon

中英文敏感词、语言检测、中外手机/电话归属地/运营商查询、名字推断性别、手机号抽取、身份证抽取、邮箱抽取、中日文人名库、中文缩写库、拆字词典、词汇情感值、停用词、反动词表、暴恐词表、繁简体转换、英文模拟中文发音、汪峰歌词生成器、职业名称词库、同义词库、反义词库、否定词库、汽车品牌词库、汽车零件词库、连续英文切割、各种中文词向量、公司名字大全、古诗词库、IT词库、财经词库、成语词库、地名词库、历史名人词库、诗词词库、医学词库、饮食词库、法律词库、汽车词库、动物词库、中文聊天语料、中文谣言数据、百度中文问答数据集、句子相似度匹配算法集合、bert资源、文本生成&摘要相关工具、cocoNLP信息抽取工具、国内电话号码正则匹配、清华大学XLORE:中英文跨语言百科知识图谱、清华大学人工智能技术系列报告、自然语言生成、NLU太难了系列、自动对联数据及机器人、用户名黑名单列表、罪名法务名词及分类模型、微信公众号语料、cs224n深度学习自然语言处理课程、中文手写汉字识别、中文自然语言处理 语料/数据集、变量命名神器、分词语料库+代码、任务型对话英文数据集、ASR 语音数据集 + 基于深度学习的中文语音识别系统、笑声检测器、Microsoft多语言数字/单位/如日期时间识别包、中华新华字典数据库及api(包括常用歇后语、成语、词语和汉字)、文档图谱自动生成、SpaCy 中文模型、Common Voice语音识别数据集新版、神经网络关系抽取、基于bert的命名实体识别、关键词(Keyphrase)抽取包pke、基于医疗领域知识图谱的问答系统、基于依存句法与语义角色标注的事件三元组抽取、依存句法分析4万句高质量标注数据、cnocr:用来做中文OCR的Python3包、中文人物关系知识图谱项目、中文nlp竞赛项目及代码汇总、中文字符数据、speech-aligner: 从“人声语音”及其“语言文本”产生音素级别时间对齐标注的工具、AmpliGraph: 知识图谱表示学习(Python)库:知识图谱概念链接预测、Scattertext 文本可视化(python)、语言/知识表示工具:BERT & ERNIE、中文对比英文自然语言处理NLP的区别综述、Synonyms中文近义词工具包、HarvestText领域自适应文本挖掘工具(新词发现-情感分析-实体链接等)、word2word:(Python)方便易用的多语言词-词对集:62种语言/3,564个多语言对、语音识别语料生成工具:从具有音频/字幕的在线视频创建自动语音识别(ASR)语料库、构建医疗实体识别的模型(包含词典和语料标注)、单文档非监督的关键词抽取、Kashgari中使用gpt-2语言模型、开源的金融投资数据提取工具、文本自动摘要库TextTeaser: 仅支持英文、人民日报语料处理工具集、一些关于自然语言的基本模型、基于14W歌曲知识库的问答尝试--功能包括歌词接龙and已知歌词找歌曲以及歌曲歌手歌词三角关系的问答、基于Siamese bilstm模型的相似句子判定模型并提供训练数据集和测试数据集、用Transformer编解码模型实现的根据Hacker News文章标题自动生成评论、用BERT进行序列标记和文本分类的模板代码、LitBank:NLP数据集——支持自然语言处理和计算人文学科任务的100部带标记英文小说语料、百度开源的基准信息抽取系统、虚假新闻数据集、Facebook: LAMA语言模型分析,提供Transformer-XL/BERT/ELMo/GPT预训练语言模型的统一访问接口、CommonsenseQA:面向常识的英文QA挑战、中文知识图谱资料、数据及工具、各大公司内部里大牛分享的技术文档 PDF 或者 PPT、自然语言生成SQL语句(英文)、中文NLP数据增强(EDA)工具、英文NLP数据增强工具 、基于医药知识图谱的智能问答系统、京东商品知识图谱、基于mongodb存储的军事领域知识图谱问答项目、基于远监督的中文关系抽取、语音情感分析、中文ULMFiT-情感分析-文本分类-语料及模型、一个拍照做题程序、世界各国大规模人名库、一个利用有趣中文语料库 qingyun 训练出来的中文聊天机器人、中文聊天机器人seqGAN、省市区镇行政区划数据带拼音标注、教育行业新闻语料库包含自动文摘功能、开放了对话机器人-知识图谱-语义理解-自然语言处理工具及数据、中文知识图谱:基于百度百科中文页面-抽取三元组信息-构建中文知识图谱、masr: 中文语音识别-提供预训练模型-高识别率、Python音频数据增广库、中文全词覆盖BERT及两份阅读理解数据、ConvLab:开源多域端到端对话系统平台、中文自然语言处理数据集、基于最新版本rasa搭建的对话系统、基于TensorFlow和BERT的管道式实体及关系抽取、一个小型的证券知识图谱/知识库、复盘所有NLP比赛的TOP方案、OpenCLaP:多领域开源中文预训练语言模型仓库、UER:基于不同语料+编码器+目标任务的中文预训练模型仓库、中文自然语言处理向量合集、基于金融-司法领域(兼有闲聊性质)的聊天机器人、g2pC:基于上下文的汉语读音自动标记模块、Zincbase 知识图谱构建工具包、诗歌质量评价/细粒度情感诗歌语料库、快速转化「中文数字」和「阿拉伯数字」、百度知道问答语料库、基于知识图谱的问答系统、jieba_fast 加速版的jieba、正则表达式教程、中文阅读理解数据集、基于BERT等最新语言模型的抽取式摘要提取、Python利用深度学习进行文本摘要的综合指南、知识图谱深度学习相关资料整理、维基大规模平行文本语料、StanfordNLP 0.2.0:纯Python版自然语言处理包、NeuralNLP-NeuralClassifier:腾讯开源深度学习文本分类工具、端到端的封闭域对话系统、中文命名实体识别:NeuroNER vs. BertNER、新闻事件线索抽取、2019年百度的三元组抽取比赛:“科学空间队”源码、基于依存句法的开放域文本知识三元组抽取和知识库构建、中文的GPT2训练代码、ML-NLP - 机器学习(Machine Learning)NLP面试中常考到的知识点和代码实现、nlp4han:中文自然语言处理工具集(断句/分词/词性标注/组块/句法分析/语义分析/NER/N元语法/HMM/代词消解/情感分析/拼写检查、XLM:Facebook的跨语言预训练语言模型、用基于BERT的微调和特征提取方法来进行知识图谱百度百科人物词条属性抽取、中文自然语言处理相关的开放任务-数据集-当前最佳结果、CoupletAI - 基于CNN+Bi-LSTM+Attention 的自动对对联系统、抽象知识图谱、MiningZhiDaoQACorpus - 580万百度知道问答数据挖掘项目、brat rapid annotation tool: 序列标注工具、大规模中文知识图谱数据:1.4亿实体、数据增强在机器翻译及其他nlp任务中的应用及效果、allennlp阅读理解:支持多种数据和模型、PDF表格数据提取工具 、 Graphbrain:AI开源软件库和科研工具,目的是促进自动意义提取和文本理解以及知识的探索和推断、简历自动筛选系统、基于命名实体识别的简历自动摘要、中文语言理解测评基准,包括代表性的数据集&基准模型&语料库&排行榜、树洞 OCR 文字识别 、从包含表格的扫描图片中识别表格和文字、语声迁移、Python口语自然语言处理工具集(英文)、 similarity:相似度计算工具包,java编写、海量中文预训练ALBERT模型 、Transformers 2.0 、基于大规模音频数据集Audioset的音频增强 、Poplar:网页版自然语言标注工具、图片文字去除,可用于漫画翻译 、186种语言的数字叫法库、Amazon发布基于知识的人-人开放领域对话数据集 、中文文本纠错模块代码、繁简体转换 、 Python实现的多种文本可读性评价指标、类似于人名/地名/组织机构名的命名体识别数据集 、东南大学《知识图谱》研究生课程(资料)、. 英文拼写检查库 、 wwsearch是企业微信后台自研的全文检索引擎、CHAMELEON:深度学习新闻推荐系统元架构 、 8篇论文梳理BERT相关模型进展与反思、DocSearch:免费文档搜索引擎、 LIDA:轻量交互式对话标注工具 、aili - the fastest in-memory index in the East 东半球最快并发索引 、知识图谱车音工作项目、自然语言生成资源大全 、中日韩分词库mecab的Python接口库、中文文本摘要/关键词提取、汉字字符特征提取器 (featurizer),提取汉字的特征(发音特征、字形特征)用做深度学习的特征、中文生成任务基准测评 、中文缩写数据集、中文任务基准测评 - 代表性的数据集-基准(预训练)模型-语料库-baseline-工具包-排行榜、PySS3:面向可解释AI的SS3文本分类器机器可视化工具 、中文NLP数据集列表、COPE - 格律诗编辑程序、doccano:基于网页的开源协同多语言文本标注工具 、PreNLP:自然语言预处理库、简单的简历解析器,用来从简历中提取关键信息、用于中文闲聊的GPT2模型:GPT2-chitchat、基于检索聊天机器人多轮响应选择相关资源列表(Leaderboards、Datasets、Papers)、(Colab)抽象文本摘要实现集锦(教程 、词语拼音数据、高效模糊搜索工具、NLP数据增广资源集、微软对话机器人框架 、 GitHub Typo Corpus:大规模GitHub多语言拼写错误/语法错误数据集、TextCluster:短文本聚类预处理模块 Short text cluster、面向语音识别的中文文本规范化、BLINK:最先进的实体链接库、BertPunc:基于BERT的最先进标点修复模型、Tokenizer:快速、可定制的文本词条化库、中文语言理解测评基准,包括代表性的数据集、基准(预训练)模型、语料库、排行榜、spaCy 医学文本挖掘与信息提取 、 NLP任务示例项目代码集、 python拼写检查库、chatbot-list - 行业内关于智能客服、聊天机器人的应用和架构、算法分享和介绍、语音质量评价指标(MOSNet, BSSEval, STOI, PESQ, SRMR)、 用138GB语料训练的法文RoBERTa预训练语言模型 、BERT-NER-Pytorch:三种不同模式的BERT中文NER实验、无道词典 - 有道词典的命令行版本,支持英汉互查和在线查询、2019年NLP亮点回顾、 Chinese medical dialogue data 中文医疗对话数据集 、最好的汉字数字(中文数字)-阿拉伯数字转换工具、 基于百科知识库的中文词语多词义/义项获取与特定句子词语语义消歧、awesome-nlp-sentiment-analysis - 情感分析、情绪原因识别、评价对象和评价词抽取、LineFlow:面向所有深度学习框架的NLP数据高效加载器、中文医学NLP公开资源整理 、MedQuAD:(英文)医学问答数据集、将自然语言数字串解析转换为整数和浮点数、Transfer Learning in Natural Language Processing (NLP) 、面向语音识别的中文/英文发音辞典、Tokenizers:注重性能与多功能性的最先进分词器、CLUENER 细粒度命名实体识别 Fine Grained Named Entity Recognition、 基于BERT的中文命名实体识别、中文谣言数据库、NLP数据集/基准任务大列表、nlp相关的一些论文及代码, 包括主题模型、词向量(Word Embedding)、命名实体识别(NER)、文本分类(Text Classificatin)、文本生成(Text Generation)、文本相似性(Text Similarity)计算等,涉及到各种与nlp相关的算法,基于keras和tensorflow 、Python文本挖掘/NLP实战示例、 Blackstone:面向非结构化法律文本的spaCy pipeline和NLP模型通过同义词替换实现文本“变脸” 、中文 预训练 ELECTREA 模型: 基于对抗学习 pretrain Chinese Model 、albert-chinese-ner - 用预训练语言模型ALBERT做中文NER 、基于GPT2的特定主题文本生成/文本增广、开源预训练语言模型合集、多语言句向量包、编码、标记和实现:一种可控高效的文本生成方法、 英文脏话大列表 、attnvis:GPT2、BERT等transformer语言模型注意力交互可视化、CoVoST:Facebook发布的多语种语音-文本翻译语料库,包括11种语言(法语、德语、荷兰语、俄语、西班牙语、意大利语、土耳其语、波斯语、瑞典语、蒙古语和中文)的语音、文字转录及英文译文、Jiagu自然语言处理工具 - 以BiLSTM等模型为基础,提供知识图谱关系抽取 中文分词 词性标注 命名实体识别 情感分析 新词发现 关键词 文本摘要 文本聚类等功能、用unet实现对文档表格的自动检测,表格重建、NLP事件提取文献资源列表 、 金融领域自然语言处理研究资源大列表、CLUEDatasetSearch - 中英文NLP数据集:搜索所有中文NLP数据集,附常用英文NLP数据集 、medical_NER - 中文医学知识图谱命名实体识别 、(哈佛)讲因果推理的免费书、知识图谱相关学习资料/数据集/工具资源大列表、Forte:灵活强大的自然语言处理pipeline工具集 、Python字符串相似性算法库、PyLaia:面向手写文档分析的深度学习工具包、TextFooler:针对文本分类/推理的对抗文本生成模块、Haystack:灵活、强大的可扩展问答(QA)框架、中文关键短语抽取工具

bert

by google-research doticonpythondoticon

star image 28940 doticonApache-2.0

TensorFlow code and pre-trained models for BERT

jieba

by fxsjy doticonpythondoticon

star image 26924 doticonMIT

结巴中文分词

Python

by geekcomputers doticonpythondoticon

star image 23653 doticonMIT

My Python Examples

HanLP

by hankcs doticonpythondoticon

star image 23581 doticonApache-2.0

中文分词 词性标注 命名实体识别 依存句法分析 语义依存分析 新词发现 关键词短语提取 自动摘要 文本分类聚类 拼音简繁转换 自然语言处理

spaCy

by explosion doticonpythondoticon

star image 23063 doticonMIT

💫 Industrial-strength Natural Language Processing (NLP) in Python

fastText

by facebookresearch doticonhtmldoticon

star image 22903 doticonMIT

Library for fast text representation and classification.

NLP-progress

by sebastianruder doticonpythondoticon

star image 18988 doticonMIT

Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.

Trending New libraries in Natural Language Processing

PaddleNLP

by PaddlePaddle doticonpythondoticon

star image 3119 doticonApache-2.0

Easy-to-use and Fast NLP library with awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications.

NLP_ability

by DA-southampton doticonpythondoticon

star image 2488 doticon

总结梳理自然语言处理工程师(NLP)需要积累的各方面知识,包括面试题,各种基础知识,工程能力等等,提升核心竞争力

texthero

by jbesomi doticonpythondoticon

star image 2212 doticonMIT

Text preprocessing, representation and visualization from zero to hero.

gpt-neox

by EleutherAI doticonpythondoticon

star image 2012 doticonApache-2.0

An implementation of model parallel autoregressive transformers on GPUs, based on the DeepSpeed library.

CLUEDatasetSearch

by CLUEbenchmark doticonpythondoticon

star image 1760 doticon

搜索所有中文NLP数据集,附常用英文NLP数据集

The-NLP-Pandect

by ivan-bilan doticonpythondoticon

star image 1418 doticonCC0-1.0

A comprehensive reference for all topics related to Natural Language Processing

longformer

by allenai doticonpythondoticon

star image 1207 doticonApache-2.0

Longformer: The Long-Document Transformer

rebiber

by yuchenlin doticonpythondoticon

star image 1200 doticonMIT

A simple tool to update bib entries with their official information (e.g., DBLP or the ACL anthology).

spago

by nlpodyssey doticongodoticon

star image 1116 doticonBSD-2-Clause

Self-contained Machine Learning and Natural Language Processing library in Go

Top Authors in Natural Language Processing

1

allenai

43 Libraries

star icon21327

2

IBM

40 Libraries

star icon1095

3

microsoft

40 Libraries

star icon13881

4

StarlangSoftware

37 Libraries

star icon203

5

googlearchive

37 Libraries

star icon6153

6

thunlp

35 Libraries

star icon9769

7

facebookresearch

31 Libraries

star icon41620

8

UKPLab

28 Libraries

star icon8695

9

linuxscout

24 Libraries

star icon739

10

undertheseanlp

22 Libraries

star icon1302

1

43 Libraries

star icon21327

2

40 Libraries

star icon1095

3

40 Libraries

star icon13881

4

37 Libraries

star icon203

5

37 Libraries

star icon6153

6

35 Libraries

star icon9769

7

31 Libraries

star icon41620

8

28 Libraries

star icon8695

9

24 Libraries

star icon739

10

22 Libraries

star icon1302

Trending Kits in Natural Language Processing

One of the most popular system for visualizing numerical data in pandas is the boxplot. which can be created by calculating the quartiles of a data set. Box plots are among the most habituated types of graphs in business, statistics, and data analysis. 


One way to plot a boxplot using the panda's data frame is to use the boxplot() function that's part of the panda's library. Boxplot is also used to discover the outlier in a data set. Pandas is a Python library built to streamline processes around acquiring and manipulating relational data that has built in methods for plotting and visualizing the values captured in its data structures. The plot() function is used to draw points in a diagram. The plot() function default draws a line from point to point. The function makes parameters for a particular point in the diagram 


Box plots are mostly used to show distributions of numeric data values, especially when you want to compare them between multiple groups. These plots are also broadly used for comparing two data sets.


Here is an example of how we can create a boxplot of Grouped column

Preview of the output that you will get on running this code from your IDE

Code

In this solution we use the boxplot of python

  1. Copy the code using the "Copy" button above, and paste it in a Python file in your IDE.
  2. Create your own Dataframe that need to be boxploted
  3. Add the numPy Library
  4. Run the file to get the Output
  5. Add plt.show() at the end of the code to Display the output


I hope you found this useful. I have added the link to dependent libraries, version information in the following sections.


I found this code snippet by searching for "Plotting boxplots for a groupby object" in kandi. You can try any such use case!


Note


  • In line 3 make sure the Import sentence starts with small I
  • create your own Dataframe for example

df = pd.DataFrame({'Group':[1,1,1,2,3,2,2,3,1,3],'M':np.random.rand(10),'F':np.random.rand(10)})

df = df[['Group','M','F']]

Environment Tested

I tested this solution in the following versions. Be mindful of changes when working with other versions.


  1. The solution is created in Python 3.7.15. Version
  2. The solution is tested on numPy 1.21.6 Version
  3. The solution is tested on matplotlib 3.5.3 Version
  4. The solution is tested on Seaborn 0.12.2 Version


Using this solution, we can able to create boxplot of grouped column using python with the help of pandas library. This process also facilities an easy to use, hassle free method to create a hands-on working version of code which would help us create boxplot in python.

Dependent Library

If you do not have pandas ,matplotlib, seaborn, and numPy that is required to run this code, you can install it by clicking on the above link and copying the pip Install command from the Spacy page in kandi. You can search for any dependent library on kandi like numPy ,Pandas, matplotlib and seaborn

Support

  1. For any support on kandi solution kits, please use the chat
  2. For further learning resources, visit the Open Weaver Community learning page.


Popular Python package Spacy is used for natural language processing. Removing named entities from a text, such as people's names, is one of the things you can perform with Spacy. This can be done using the ents property of a Spacy document, which returns a list of named entities that have been identified in the text.  


Removing names using Spacy can have several applications, including:  

  • Anonymizing text data: Removing names from text data can be useful for protecting the privacy of individuals mentioned in the data.  
  • Text summarization: Removing named entities, such as names of people, organizations, and locations, can help to reduce the amount of irrelevant information in a text and improve the overall readability of the summary.  
  • Text classification: Removing named entities can help to improve the performance of text classification models by reducing the amount of noise in the text and making it easier for the model to identify the relevant features.  
  • Sentiment Analysis: Removing names can help to improve the accuracy of sentiment analysis by reducing the amount of personal bias and opinion present in the text.  


Here is how you can remove names using Spacy:  

Preview of the output that you will get on running this code from your IDE

Code

In this solution we used spacy library of python.

  1. Copy the code using the "Copy" button above, and paste it in a Python file in your IDE.
  2. Enter the Text
  3. Run the file to annihilate Names in the text


I hope you found this useful. I have added the link to dependent libraries, version information in the following sections.


I found this code snippet by searching for "Using Spacy to remove names from a data frame "in kandi. You can try any such use case.


Note


In this snippet we are using a Language Model (en_core_wb_trf)

  1. Download the model using the command python-m spacy download en_core_web_trf.
  2. Paste it in your terminal and download it

Environment Tested

I tested this solution in the following versions. Be mindful of changes when working with other versions.


  1. The solution is created in Python 3.7.15 Version
  2. The solution is tested on Spacy 3.4.3 Version
  3. the solution is tested on Pandas 1.3.5 Version


Using this solution, we can remove Names in text with the help of function in spacy . This process also facilities an easy to use, hassle free method to create a hands-on working version of code which would help us remove names in the text in python.

Dependent Library

If you do not have SpaCy and pandas that is required to run this code, you can install it by clicking on the above link and copying the pip Install command from the Spacy page in kandi.

You can search for any dependent library on kandi like Spacy and pandas

Support

  1. For any support on kandi solution kits, please use the chat
  2. For further learning resources, visit the Open Weaver Community learning page

We will locate a specific group of words in a text using the SpaCy library, then replace those words with an empty string to remove them from the text.  


Using SpaCy, it is possible to exclude words within a specific span from a text in the following ways:  

  • Text pre-processing: Removing specific words or phrases from text can be a useful step in pre-processing text data for NLP tasks such as text classification, sentiment analysis, and language translation.  
  • Document summarization: Maintaining only the most crucial information, specific words or phrases will serve to construct a summary of a lengthy text.  
  • Data cleaning: Anonymization and data cleaning can both benefit from removing sensitive or useless text information, such as names and addresses.  
  • Text generation: Adding context or meaning to the generated content might help create new text by deleting specific words or phrases.  
  • Text augmentation: Text can be used for text augmentation techniques in NLP by removing specific words or phrases and replacing them with new text variations.  


Here is how you can remove words in span using SpaCy:  

Preview of the output that you will get on running this code from your IDE

Code

In this solution we have used spacy library of python

  1. Copy the code using the "Copy" button above, and paste it in a Python file in your IDE.
  2. Enter the Text
  3. Run the code that Remove Specific words in the text


I hope you found this useful. I have added the link to dependent libraries, version information in the following sections.


I found this code snippet by searching for "Remove words in span from spacy" in kandi. You can try any such use case!


Note


In this snippet we are using a Language model (en_core_web_sm)

  1. Download the model using the command python -m spacy download en_core_web_sm .
  2. paste it in your terminal and download it.


Check the user's spacy version using pip show spacy command in users terminal.

  1. if its version 3.0, you will need to load it using nlp = spacy.load("en_core_web_sm")
  2. if its version is less than 3.0 you will need to load it using nlp = spacy.load("en")

Environment Tested

I tested this solution in the following versions. Be mindful of changes when working with other versions.


  1. The solution is created in Python 3.7.15 Version
  2. The solution is tested on Spacy 3.4.3 Version


Using this solution, we can collect nouns that ends with s-t-l with the help of function in spacy . This process also facilities an easy to use, hassle free method to create a hands-on working version of code which would help us use full stop whenever the user needs in the sentence in python.

Dependent Library

If you do not have SpaCy and numpy that is required to run this code, you can install it by clicking on the above link and copying the pip Install command from the Spacy page in kandi.

You can search for any dependent library on kandi like SpaCy and numpy

Support

  1. For any support on kandi solution kits, please use the chat
  2. For further learning resources, visit the Open Weaver Community learning page

Here are some of the famous C++ Natural Language Libraries. Some of the use cases of C++ Natural Language Libraries include Text Processing, Speech Recognition, Machine Translation, and Natural Language Understanding.


C++ natural language libraries are software libraries written in the C++ programming language that are used to process natural language, such as English, and extract meaning from text. These libraries are often used for natural language processing (NLP) applications, like text classification, sentiment analysis, and machine translation.


Let us have a look at some of the famous C++ Natural Language Libraries in detail below.

MITIE

  • Designed to be highly scalable, allowing it to process large amounts of text quickly and efficiently.
  • Uses a combination of statistical and machine learning techniques to identify relationships between words, phrases, and sentences.
  • Written in C++, making it easy to integrate with existing applications and systems.

Gate

  • Only library of its kind that offers multi-platform support for Windows, Mac, and Linux.
  • Allows developers to annotate text with semantic information, enabling more powerful natural language processing applications.
  • Only library of its kind that uses Java, making it more easily accessible to developers with existing Java skills.

spacy-cpp

  • One of the fastest C++ natural language libraries, offering up to 30x faster performance than similar libraries.
  • Includes features like tokenization, part-of-speech tagging, dependency parsing, and rule-based matching.
  • Designed to scale well for large datasets, making it a good choice for enterprise-level applications.

snowball

  • Offers a wide range of functions for stemming, lemmatization, and other natural language processing tasks.
  • Able to handle most Unicode characters and works across different platforms.
  • Offers powerful stemmers for multiple languages, including English, Spanish, French, German, Portuguese, and Italian.

aiml

  • Flexible platform that supports a wide range of use cases.
  • Designed to represent natural language.
  • Powerful library that can be used to create complex conversations and interactions with users.

polyglot

  • Designed for scalability, allowing developers to deploy applications on a distributed computing cluster.
  • Offers a range of tools that make it easier to develop and deploy natural language processing applications.
  • Designed to be highly portable, allowing developers to write code that can run on any platform and operating system.

NLTK

  • Open-source, so it is available to anyone and can be modified to fit specific needs.
  • Written in Python, making it more accessible and easier to use than other C++ natural language libraries.
  • Has a graphical user interface, which makes it easy to explore the data and develop models.

wordnet

  • Organized into semantic categories and hierarchical structures, allowing users to quickly find related words and their definitions.
  • Provides access to synonyms and antonyms, making it unique from other C++ natural language libraries.
  • Provides access to a corpus of example sentences and usage notes.

Tokenization is the division of a text string into discrete tokens. It offers the option to personalize tokenization by building a custom tokenizer.


There are several uses for customizing tokens in SpaCy, some of which include: 

  • Handling special input forms: A custom tokenizer can be used to handle specific input formats, such as those seen in emails or tweets, and tokenize the text in accordance. 
  • Enhancing model performance: Custom tokenization can help your model perform better by giving it access to more pertinent and instructive tokens. 
  • Managing non-standard text: Some text inputs may contain non-standard words or characters, which require special handling. 
  • Handling multi-language inputs: A custom tokenizer can be used to handle text inputs in multiple languages by using language-specific tokenization methods. 
  • Using customized tokenization in a particular field: Text can be tokenized appropriately by using customized tokenization in a particular field, such as the legal, medical, or scientific fields. 


Here is how you can customize tokens in SpaCy: 

Preview of the output that will get on running this code from your IDE

Code

In this solution we have used matcher function of Spacy library.

  1. Copy this code using "Copy" button above and paste it in your Python file IDE
  2. Enter the text that needed to be Tokenized
  3. Run the program to get Tokenize the given text.


I hope you found this useful i have added the Dependent Library ,versions and information in the following sections


I found this code snippet by searching "Customize Tokens using spacy" in Kandi. You can try any use case

Environment Tested

I tested this solution in the following version. Be mindful of changes when working with other versions


  1. This solution is created and executed in Python 3.7.15 version
  2. This solution is tested in Spacy on 3.4.3 version


Using this solution we can Tokenize the text which means it will break the text down into analytical units need for further processing. This process also facilities an easy to use, hassle free method to create a hands-on working version of code which would help us break the text in Python.

Dependent Libraries

If you do not have SpaCy that is required to run this code, you can install it by clicking on the above link and copying the pip Install command from the Spacy page in kandi. You can search for any dependent library on kandi like Spacy

Support


  1. For any support on kandi solution kits, please use the chat
  2. For further learning resources, visit the Open Weaver Community learning page.

Using SpaCy, you may utilize the techniques below to identify the full sentence that includes a particular keyword:  

  • Load the desired language model and import the SpaCy library.  
  • By feeding the text data via the SpaCy nlp object, you can process it.  
  • Repeat over the sentences in the processed text, ensuring that each one has the keyword.  

Finding entire phrases that contain a particular term can be done using a variety of apps, including:  

  • Text mining: This technique can be used to extract pertinent facts from massive amounts of text data by looking for sentences that include a particular keyword.  
  • Information retrieval: Users can quickly locate pertinent information in a document or group of documents by searching for sentences that contain a particular keyword.  
  • Question-answering: Finding sentences that answer a question can help question-answering systems be more accurate.  
  • Text summarization: Finding sentences with essential words in them can aid in creating a summary of a text that accurately conveys its primary concepts.  
  • Evaluation of the language model: The ability of the language model to produce writing that is human-like can be assessed by locating full sentences that contain a keyword.  


Here is how you can find the complete sentence that contains your keyword:  

Preview of the output that you will get on running this code from your IDE

Code

In this solution we have used Matcher function of SpaCy Library.

  1. Copy the code using the "Copy" button above, and paste it in a Python file in your IDE.
  2. Enter the Text
  3. Run the code that Find the Complete Sentence you looking for.


I hope you found this useful. I have added the link to dependent libraries, version information in the following sections.


I found this code snippet by searching for "How to extract sentence with key phrases in SpaCy" in kandi. You can try any such use case!


Note


In this snippet we are using a Language model (en_core_web_sm)

  1. Download the model using the command python -m spacy download en_core_web_sm .
  2. paste it in your terminal and download it.


Check the user's spacy version using pip show spacy command in users terminal.

  1. if its version 3.0, you will need to load it using nlp = spacy.load("en_core_web_sm")
  2. if its version is less than 3.0 you will need to load it using nlp = spacy.load("en")


Environment Tested

I tested this solution in the following versions. Be mindful of changes when working with other versions.


  1. The solution is created in Python 3.7.15 Version
  2. The solution is tested on Spacy 3.4.3 Version


Using this solution, we can collect the complete sentence that user need with the help of function in spacy . This process also facilities an easy to use, hassle free method to create a hands-on working version of code which would help us collect the sentence or keywords the user needs in python.

Dependent Library

If you do not have SpaCy that is required to run this code, you can install it by clicking on the above link and copying the pip Install command from the Spacy page in kandi.

You can search for any dependent library on kandi like SpaCy

Support

  1. For any support on kandi solution kits, please use the chat
  2. For further learning resources, visit the Open Weaver Community learning page

In the spaCy library, a token refers to a single word or punctuation mark that is part of a larger document. Tokens have various attributes, such as the text of the token, its part-of-speech tag, and its dependency label, that can be used to extract information from the text and understand its meaning. 


In spaCy, tokens can be merged into a single token using the “Doc.merge()” method. This method takes two arguments: the first is the start token, and the second is the end token of the span of tokens that you want to merge. 

  • Doc.merge(): This combines multiple individual tokens into a single token, which can be useful for various natural language processing tasks. 


Merging spaCy tokens into a Doc allows you to group multiple individual tokens into a single token, which can be useful for various natural language processing tasks. 


 You may have a look at the code below for more information about merging SpaCy tokens into a doc. 

Preview of the output that you will get on running this code from your IDE

Code

In this solution we have used spaCy - Retokenizer.merge Method from SpaCy.

  1. Copy the code using the "Copy" button above, and paste it in a Python file in your IDE.
  2. Enter the Text
  3. Run the file to Merge tokens in doc


I hope you found this useful. I have added the link to dependent libraries, version information in the following sections.


I found this code snippet by searching for "Merge sapcy tokens into a Doc " in kandi. You can try any such use case!

Environment Tested

I tested this solution in the following versions. Be mindful of changes when working with other versions.


  1. The solution is created in Python 3.7.15 Version
  2. The solution is tested on Spacy 3.4.3 Version


Using this solution, we can merge the tokens into doc with the help of function in spacy . This process also facilities an easy to use, hassle free method to create a hands-on working version of code which would help us merge the tokens in python.

Dependent Library

If you do not have SpaCy that is required to run this code, you can install it by clicking on the above link and copying the pip Install command from the Spacy page in kandi.

You can search for any dependent library on kandi like Spacy

Support

  1. For any support on kandi solution kits, please use the chat
  2. For further learning resources, visit the Open Weaver Community learning page

Hyphenated words have a hyphen (-) between two or more parts of the word. These parts of the word are often used to join commonly used words.   


Tokenization is breaking down a piece of text into smaller units called tokens. Tokens are the basic building blocks of a text, and they can be words, phrases, sentences, or even individual characters, depending on the task and the granularity level required. The tokenization of hyphenated words can be tricky, as the hyphen can indicate different things depending on the context and the language. There are various ways to handle hyphenated words during tokenization, and the best method will depend on the specific task and the desired level of granularity.  

  • Treat the entire word as a single token: It treats the entire word, including the hyphen, as a single token.  
  • Treat the word as two separate tokens: This method splits the word into two separate tokens, one for each part of the word.  
  • Treat the hyphen as a separate token: This method treats the hyphen as a separate token.  

 

You may have a look at the code below for more information about Tokenization of hyphenated words.

Preview of the output that you will get on running this code from your IDE

Code

In this solution we have used Tokenizer function of NLTK.

  1. Copy the code using the "Copy" button above, and paste it in a Python file in your IDE.
  2. Enter the Text
  3. Run the file to Tokenize the Hyphenated words


I hope you found this useful. I have added the link to dependent libraries, version information in the following sections.


I found this code snippet by searching for "Tokenization of Hyphenated Words" in kandi. You can try any such use case!


Note


In this snippet we are using a Language model (en_core_web_sm)

  1. Download the model using the command python -m spacy download en_core_web_sm .
  2. paste it in your terminal and download it.


Check the user's spacy version using pip show spacy command in users terminal.

  1. if its version 3.0, you will need to load it using nlp = spacy.load("en_core_web_sm")
  2. if its version is less than 3.0 you will need to load it using nlp = spacy.load("en")


Environment Tested

I tested this solution in the following versions. Be mindful of changes when working with other versions.


  1. The solution is created in Python 3.7.15 version.
  2. The solution is tested on Spacy 3.4.3 version.


Using this solution, we are able to Tokenize the Hyphenated words in Python with simple steps. This process also facilities an easy to use, hassle free method to create a hands-on working version of code which would help us Tokenize the words in Python.

Dependent Library

If you do not have SpaCy that is required to run this code, you can install it by clicking on the above link and copying the pip Install command from the Spacy page in kandi.

You can search for any dependent library on kandi like Spacy

Support

  1. For any support on kandi solution kits, please use the chat
  2. For further learning resources, visit the Open Weaver Community learning page

To remove names from noun chunks, you can use the SpaCy library in Python. You can first load the library and load a pre-trained model, then use the noun chunks attribute to extract all of the noun chunks in a given text. Then you can use a loop to iterate through each chunk and an if statement to check if the chunk contains a proper noun. If the chunk contains a proper noun, you can remove it from the text.  


The removal of names from noun chunks has a variety of uses, such as:  

  • Data anonymization: To respect people's privacy and adhere to data protection laws, text can be made anonymous by removing personal identifiers.  
  • Text summarization: By omitting proper names from the text, it is possible to condense the length of a summary while maintaining the important points.  
  • Text classification: By lowering the amount of noise in the input data, removing proper names from text improves text classification algorithms' performance.  
  • Sentiment analysis: By removing proper names, sentiment analysis can be made more objective.  
  • Text-to-Speech: By removing appropriate names from the discourse, text-to-speech can sound more natural.  


Here is how you can remove names from noun chunks:  

Preview of the output that you will get on running this code from your IDE

Code

In this solution we have used Spacy library of python.

Instructions:


  1. Download and install VS Code on your desktop.
  2. Open VS Code and create a new file in the editor.
  3. Copy the code snippet that you want to run, using the "Copy" button or by selecting the text and using the copy command (Ctrl+C on Windows/Linux or Cmd+C on Mac).
  4. Paste the code into your file in VS Code, and save the file with a meaningful name.
  5. Open a terminal window or command prompt on your computer.
  6. For download spacy: use this command pip install spacy [3.4.3]
  7. Once spacy is installed, you can download the en_core_web_sm model using the following command: python -m spacy download en_core_web_sm Alternatively, you can install the model directly using pip: pip install en_core_web_sm
  8. To run the code, open the file in VS Code and click the "Run" button in the top menu, or use the keyboard shortcut Ctrl+Alt+N (on Windows and Linux) or Cmd+Alt+N (on Mac). The output of your code will appear in the VS Code output console.


I hope you found this useful. I have added the link to dependent libraries, version information in the following sections.


I found this code snippet by searching for "Remove Name from noun chuncks using SpaCy" in kandi. You can try any such use case!

Environment Tested

I tested this solution in the following versions. Be mindful of changes when working with other versions.


  1. The solution is created in Python 3.7.15 Version
  2. The solution is tested on Spacy 3.4.3 Version


Using this solution, we can Extract names from noun chuncks with the help of function in spacy . This process also facilities an easy to use, hassle free method to create a hands-on working version of code which would help us extract the Nouns python.

Dependent Library

If you do not have SpaCy that is required to run this code, you can install it by clicking on the above link and copying the pip Install command from the Spacy page in kandi.

You can search for any dependent library on kandi like SpaCy

Support

  1. For any support on kandi solution kits, please use the chat
  2. For further learning resources, visit the Open Weaver Community learning page

An attribute error occurs in Python when a code tries to access an attribute (i.e., a variable or method) that does not exist in an object or class. For example, if you try to access an instance variable that has not been defined, you will get an attribute error.  


When using spaCy, an attribute error can happen if you try to access a property or attribute of an object (such as a token or doc) that is not declared or doesn't exist. To fix this, either use the “hasattr()” function to check whether the attribute is present before attempting to access it or check whether the attribute is present before attempting to access it.  

  • hasattr(): hasattr() is a built-in Python function that is used to check if an object has a given attribute. It takes two arguments: the object to check and the attribute’s name as a string. If the object has the attribute, hasattr() returns True. Otherwise, it returns False.  

It is important to read the spaCy documentation to understand the properties and methods provided by spaCy for different objects.  

 

You may have a look at the code below for more information about solving attribute errors using SpaCy.  

Preview of the output that you will get on running this code from your IDE

Code

In this solution we have used Spacy library.

  1. Copy the code using the "Copy" button above, and paste it in a Python file in your IDE.
  2. Run the program to get the text to lemmatize


I hope you found this useful. I have added the link to dependent libraries, version information in the following sections.


I found this code snippet by searching for "How can i solve an attribute error when using SpaCy " in kandi. You can try any such use case!

Environment Tested

I tested this solution in the following versions. Be mindful of changes when working with other versions.


  1. The solution is created in Python 3.7.15 Version
  2. The solution is tested on Spacy 3.4.3 Version


Using this solution, we are going to Lemmatize the words with help of Spacy library. This process also facilities an easy to use, hassle free method to create a hands-on working version of code which would help us Lemmatize the words in Python.

Dependent Library

If you do not have SpaCy that is required to run this code, you can install it by clicking on the above link and copying the pip Install command from the Spacy page in kandi.

You can search for any dependent library on kandi like Spacy

Support

  • For any support on kandi solution kits, please use the chat
  • For further learning resources, visit the Open Weaver Community learning page.

Entities are specific pieces of information that can be extracted from a text. They can be categorized into different types: person, location, organization, event, product etc. These are some common entity types, but other entities may depend on the specific use case or domain.  


Tagging entities in a string, also known as named-entity recognition (NER), is a way to extract structured information from unstructured text. Tagging entities involves identifying and classifying specific pieces of information, such as people, places, and organizations, and labeling them with specific tags or labels. There are several ways to tag entities in a string, some of which include:  

  • Regular expressions: This method uses pattern matching to identify entities in a string.  
  • Named Entity Recognition (NER): This method uses machine learning algorithms to identify entities in a string. It is commonly used in natural language processing tasks.  
  • Dictionary or lookup-based method: This method uses a pre-defined dictionary or lookup table to match entities in a string.  

 

You may have a look at the code below for more information about tagging entities in string.

Preview of the output that you will get on running this code from your IDE

Code

In this solution we have used Spacy library.

  1. Copy the code using the "Copy" button above, and paste it in a Python file in your IDE.
  2. Enter the Text
  3. Run the file to Tag the entities in the string


I hope you found this useful. I have added the link to dependent libraries, version information in the following sections.


I found this code snippet by searching for "Tag entities in the string using Spacy " in kandi. You can try any such use case!


Note


In this snippet we are using a Language model (en_core_web_sm)

  1. Download the model using the command python -m spacy download en_core_web_sm .
  2. paste it in your terminal and download it.


Check the user's spacy version using pip show spacy command in users terminal.

  1. if its version 3.0, you will need to load it using nlp = spacy.load("en_core_web_sm")
  2. if its version is less than 3.0 you will need to load it using nlp = spacy.load("en")


Environment Tested

I tested this solution in the following versions. Be mindful of changes when working with other versions.


  1. The solution is created in Python 3.7.15 Version
  2. The solution is tested on Spacy 3.4.3 Version


Using this solution, tag entities in the string with the help of regular expression function in spacy . This process also facilities an easy to use, hassle free method to create a hands-on working version of code which would help us to tag the entities in python.

Dependent Library

If you do not have SpaCy that is required to run this code, you can install it by clicking on the above link and copying the pip Install command from the Spacy page in kandi.

You can search for any dependent library on kandi like Spacy

Support

  1. For any support on kandi solution kits, please use the chat
  2. For further learning resources, visit the Open Weaver Community learning page

The spaCy library provides the Doc object to represent a document, which can be tokenized into individual words or phrases (tokens) using the “doc.sents” and doc[i] attributes. You can convert a Doc object into a nested list of tokens by iterating through the sentences in the document, and then iterating through the tokens in each sentence. 

  • spaCy: spaCy is a library for advanced natural language processing in Python. It is designed specifically for production use, and it is fast and efficient. spaCy is widely used for natural language processing tasks such as named entity recognition, part-of-speech tagging, text classification, and others. 
  • Doc.sents: Doc.sents allows you to work with individual sentences easily and efficiently in a text, rather than having to manually split the text into sentences yourself. This can be useful in a variety of natural languages processing tasks, such as sentiment analysis or text summarization, where it's important to be able to work with individual sentences. 


To learn more about the topic, you may have a look at the code below

Preview of the output that you will get on running this code from your IDE

Code

In this solution we used spaCy library of python.

  1. Copy the code using the "Copy" button above, and paste it in a Python file in your IDE.
  2. Import Sapcy library
  3. Run the file to turn spacy doc into nested List of tokens.


I hope you found this useful. I have added the link to dependent libraries, version information in the following sections.


I found this code snippet by searching for "How to turn spacy doc into nested list of tokens"in kandi. You can try any such use case.

Environment Tested

I tested this solution in the following versions. Be mindful of changes when working with other versions.


  1. The solution is created in Python 3.7.15 Version
  2. The solution is tested on Spacy 3.4.3 Version


Using this solution, we can turn the spacy doc into nested list in tokens with the help of function in spacy . This process also facilities an easy to use, hassle free method to create a hands-on working version of code which would help us turn the doc to nestled list in the text in python.

Dependent Library

If you do not have SpaCy that is required to run this code, you can install it by clicking on the above link and copying the pip Install command from the Spacy page in kandi.

You can search for any dependent library on kandi like Spacy

Support

  1. For any support on kandi solution kits, please use the chat
  2. For further learning resources, visit the Open Weaver Community learning page

"Matching a pattern" refers to finding occurrences of a certain word pattern or other linguistic components inside a text. A regular expression, which is a string of characters that forms a search pattern, is frequently used for this. 


In Python, you can use the “re” module to match patterns in strings using regular expressions. There are several functions and techniques for locating and modifying patterns in strings available in the “re” module. 

  • re.search(): This is used to search for a specific pattern in a text. It is a component of the Python regular expression (re) module, which offers a collection of tools for using regular expressions. 

A document's token patterns may be matched using the “Matcher” class in spaCy. 

  • Matcher: It returns the spans in the document that match a set of token patterns that are input. 
  • spaCy: With the help of this well-known Python module for natural language processing, users may interact with text data in a rapid and easy manner. It contains ready-to-use pre-trained models for a variety of languages. It is often used for a range of NLP applications in both industry and academics. 


You can have a look at the code below to match the pattern using SpaCy.

Preview of the output that you will get on running this code from your IDE

Code

In this solution we use the Matcher function of the SpaCy library.

  1. Copy the code using the "Copy" button above, and paste it in a Python file in your IDE.
  2. Enter the Text that need to be matched
  3. Run the file to find a matching our pattern.


I hope you found this useful. I have added the link to dependent libraries, version information in the following sections.


I found this code snippet by searching for "Pattern in Spacy " in kandi. You can try any such use case!


Note


  1. In this solution the function takes only two arguments only, so please delete the None argument in Line 11.
  2. The new version of Spacy need brackets around Pattern. Therefore in this case close the pattern using square bracket in Line 11

Environment Tested

I tested this solution in the following versions. Be mindful of changes when working with other versions.


  1. The solution is created in Python 3.7.15 Version
  2. The solution is tested on Spacy 3.4.3 Version


Using this solution, we can able to find matchers to our pattern using python with the help of Spacy library. This process also facilities an easy to use, hassle free method to create a hands-on working version of code which would help us find matches to our pattern in python.

Dependent Library

If you do not have SpaCy that is required to run this code, you can install it by clicking on the above link and copying the pip Install command from the Spacy page in kandi.

You can search for any dependent library on kandi like Spacy

Support

  1. For any support on kandi solution kits, please use the chat
  2. For further learning resources, visit the Open Weaver Community learning page.

In spaCy, the Matcher class allows you to match patterns of tokens in a Doc object. The Matcher is initialized with a vocabulary, and then patterns can be added to the matcher using the “Matcher.add()” method. The patterns that are added to the matcher are defined using a list of dictionaries, where each dictionary represents a token and its attributes. 

  • Matcher.add(): In spaCy, the Matcher.add() method is used to add patterns to a Matcher object, which can then be used to find matches in a Doc object. 

Once the patterns have been added to the matcher, you can use the “Matcher.matches()” method to find all instances of the specified patterns in a Doc object. 

  • Matcher.matches(): The method returns a list of tuples, where each tuple represents a match and contains the start and end index of the matching span in the Doc. This can be useful in various NLP tasks such as information extraction, text summarization, and sentiment analysis. 

 

You may have a look at the code below for more information about SpaCy matcher patterns with specific nouns. 

Preview of the output that you will get on running this code from your IDE

Code

In this solution we have used Matcher function of Spacy.

  1. Copy the code using the "Copy" button above, and paste it in a Python file in your IDE.
  2. Enter the Text
  3. Run the file to get the Nouns that ends with s-l-t.


I hope you found this useful. I have added the link to dependent libraries, version information in the following sections.


I found this code snippet by searching for "Spacy matcher pattern with specific Nouns" in kandi. You can try any such use case!

Environment Tested

I tested this solution in the following versions. Be mindful of changes when working with other versions.


  1. The solution is created in Python 3.7.15 Version
  2. The solution is tested on Spacy 3.4.3 Version


Using this solution, we can collect nouns that ends with s-t-l with the help of function in spacy . This process also facilities an easy to use, hassle free method to create a hands-on working version of code which would help us Collect the whatever the nouns user needs in python

Dependent Library

If you do not have SpaCy that is required to run this code, you can install it by clicking on the above link and copying the pip Install command from the Spacy page in kandi.

You can search for any dependent library on kandi like Spac

Support

  1. For any support on kandi solution kits, please use the chat
  2. For further learning resources, visit the Open Weaver Community learning page

In SpaCy, you can use the part-of-speech (POS) tagging functionality to classify words as nouns. POS tagging is the process of marking each word in a text with its corresponding POS tag. 


Numerous uses for noun classification using SpaCy include: 

  • Information Extraction: You can extract important details about the subjects discussed in a document, such as individuals, organizations, places, etc., by identifying the nouns in the text. 
  • Text summarization: You can extract the key subjects or entities discussed in a text and use them to summarize the text by selecting important nouns in the text. 
  • Text classification: You can categorize a text into different categories or themes by determining the most prevalent nouns in the text. 
  • Text generation: You can create new material that is coherent and semantically equivalent to the original text by identifying the nouns in a text and the relationships between them. 
  • Named Entity Recognition (NER): SpaCy provides built-in support for NER, which can be used to extract entities from text with high accuracy. 
  • Query Expansion 
  • Language Translation 


Here is how you can perform noun classification using SpaCy:

Preview of the output that you will get on running this code from your IDE

Code

in this solution we have used Spacy library.

Instructions:


  1. Download and install VS Code on your desktop.
  2. Open VS Code and create a new file in the editor.
  3. Copy the code snippet that you want to run, using the "Copy" button or by selecting the text and using the copy command (Ctrl+C on Windows/Linux or Cmd+C on Mac).
  4. Paste the code into your file in VS Code, and save the file with a meaningful name.
  5. Open a terminal window or command prompt on your computer.
  6. For download spacy: use this command pip install spacy [3.4.3]
  7. Once spacy is installed, you can download the en_core_web_sm model using the following command: python -m spacy download en_core_web_sm Alternatively, you can install the model directly using pip: pip install en_core_web_sm
  8. To run the code, open the file in VS Code and click the "Run" button in the top menu, or use the keyboard shortcut Ctrl+Alt+N (on Windows and Linux) or Cmd+Alt+N (on Mac). The output of your code will appear in the VS Code output console.


I hope you found this useful. I have added the link to dependent libraries, version information in the following sections.


I found this code snippet by searching for "Noun Classification using Spacy " in kandi. You can try any such use case!

Environment Tested

I tested this solution in the following versions. Be mindful of changes when working with other versions.


  1. The solution is created in Python 3.7.15 version
  2. The solution is tested on Spacy 3.4.3 version


Using this solution, we are able to collect the noun separately in the text with the help of Spacy Library in python. This process also facilities an easy to use, hassle free method to create a hands-on working version of code which would help us to collect Nouns using Python.

Dependent Library

If you do not have SpaCy that is required to run this code, you can install it by clicking on the above link and copying the pip Install command from the Spacy page in kandi. You can search for any dependent library on kandi like Spacy

Support

  1. For any support on kandi solution kits, please use the chat
  2. For further learning resources, visit the Open Weaver Community learning page.

The Dependency Matcher, a potent tool offered by the SpaCy library, can be used to match particular phrases based on the dependency parse of a sentence. Instead of matching word sequences based on their straightforward surface forms, the Dependency Matcher enables you to do so.  


The SpaCy Dependency Matcher can be used in a variety of ways to match particular phrases based on their dependencies, such as:  

  • Text categorization: You can extract particular phrases from text using the Dependency Matcher.  
  • Information extraction: The Dependency Matcher can be used to extract specific data from language, including attributes, costs, and features of goods.  
  • Question answering: The Dependency Matcher can be used to identify the subject, verb, and object in a sentence to improve the accuracy of question answering systems.  
  • Text generation: By matching particular phrases based on their dependencies, the Dependency Matcher may produce text that is grammatically accurate and semantically relevant.  
  • Text summarization: The Dependency Matcher can be used to identify key phrases that capture the text's essential concepts and serve as a summary.  

  

Here is how you can find the Spacy Regex Pharse Dependency Matcher in python

Preview of the output that you will get on running this code from your IDE

Code

In this solution we have used Pandas Library.

  1. Copy the code using the "Copy" button above, and paste it in a Python file in your IDE.
  2. Enter the Text
  3. Run the code to get the Output


I hope you found this useful. I have added the link to dependent libraries, version information in the following sections.


I found this code snippet by searching for "Spacy Regex Phrase using Dependency matcher" in kandi. You can try any such use case!

Environment Tested

I tested this solution in the following versions. Be mindful of changes when working with other versions.


  1. The solution is created in Python 3.7.15 Version
  2. The solution is tested on Spacy 3.4.3 Version


Using this solution, we can collect the sentence that user needs with the help of function in spacy . This process also facilities an easy to use, hassle free method to create a hands-on working version of code which would help us extract the sentence in python.

Dependent Library

If you do not have SpaCy that is required to run this code, you can install it by clicking on the above link and copying the pip Install command from the Spacy page in kandi.

You can search for any dependent library on kandi like SpaCy

Support

  1. For any support on kandi solution kits, please use the chat
  2. For further learning resources, visit the Open Weaver Community learning page

NLP helps build systems that can automatically analyze, process, summarize and extract meaning from natural language text. In a nutshell, NLP helps machines mimic human behaviour and allows us to build applications that can reason about different types of documents. NLP open-source libraries are tools that allow you to build your own NLP applications. These libraries can be used to develop many different types of applications, like Speech Recognition, chatbots, Sentimental Analysis, Email Spam Filtering, Language Translator, search engines, and question answering systems. NLTK is one of the most popular NLP libraries in Python. It provides easy-to-use interfaces to corpora and lexical resources such as WordNet, along with statistical models for common tasks such as part-of-speech tagging and noun phrase extraction. Following list has libraries for most basic Sentimental Analysis VADER (Valence Aware Dictionary and sEntiment Reasoner) Tool, collection of NLP resources – blogs, books, tutorials, and more. Check out the list of free, open source libraries to help you with your projects:

Some Popular Open Source Libraries to get you started

Utilize the below libraries to tokenize, implement part-of-speech tagging, stemming, sentiment analysis, topic segmentation, and named entity recognition.

Sentimental Analysis Repository

Some interesting courses to Deep Dive

1. List of Popular Courses on NLP 2. Stanford Course on Natural Language Processing 3. André Ribeiro Miranda- NLP Course

Recording from Session on Build AI fake News Detector

Watch recording of a live training session on AI Fake News Detection

Example project on AI Virtual Agent that you can build in 30 mins

Here's a project with the installer, source code, and step-by-step tutorial that you can build in under 30 mins. ⬇️Get the 1-Click install AI Virtual Agent kit Watch recording of a live training session on AI Virtual Agent

Natural Language Processing (NLP) is a broad subject that falls under the Artificial Intelligence (AI) domain. NLP allows computers to interpret text and spoken language in the same way that people do. NLP must be able to grasp not only words, but also phrases and paragraphs in their context based on syntax, grammar, and other factors. NLP algorithms break down human speech into machine-understandable fragments that can be utilized to create NLP-based software.

Because of the development of useful NLP libraries, NLP is now finding applications across a wide range of industries. NLP has become a critical component of Deep Learning development. Among other NLP applications, extracting useful information from text is crucial for building chatbots and virtual assistants, among other NLP applications, because training NLP algorithms require a large amount of data for better performance, but our Google Assistant and Alexa are becoming more natural by the day. Here are some basic libraries to get started with NLP.

NLTK Natural Language Toolkit is one of the most frequently used libraries in the industry for building Python applications that interact with human language data. NLTK can assist you with anything from splitting sentences from paragraphs to recognizing the part of speech of specific phrases to emphasizing the primary theme. It is a highly important tool for preparing text for future analysis, such as when using Models. It assists in the translation of words into numbers, with which the model may subsequently function. This collection contains nearly all of the tools required for NLP. It helps with text classification, tokenization, parsing, part-of-speech tagging and stemming. spaCy spaCy is a python library built for sophisticated Natural Language Processing. It is based on cutting-edge research and was intended from the start to be utilized in real-world products. spaCy has pre-trained pipelines and presently supports tokenization and training for more than 60 languages. It includes cutting-edge speed and neural network models for tagging, parsing, named entity identification, text classification, and other tasks, as well as a production-ready training system and simple model packaging, deployment, and workflow management. Gensim Gensim is a well-known Python package for doing natural language processing tasks. It has a unique feature that uses vector space modeling and topic modeling tools to determine the semantic similarity between two documents.

CoreNLP CoreNLP can be used to create linguistic annotations for text, such as Token and sentence boundaries, Parts of speech, Named entities, Numeric and temporal values, dependency and constituency parser, Sentiment, Quotation attributions, and Relations between words. CoreNLP supports a variety of Human languages such as Arabic, Chinese, English, French, German, and Spanish. It is written in Java but has support for Python as well. Pattern Pattern is a python based NLP library that provides features such as part-of-speech tagging, sentiment analysis, and vector space modeling. It offers support for Twitter and Facebook APIs, a DOM parser, and a web crawler. Pattern is often used to convert HTML data to plain text and resolve spelling mistakes in textual data. Polyglot Polyglot library provides an impressive breadth of analysis and covers a wide range of languages. Polyglot's SpaCy-like efficiency and ease of use make it an excellent choice for projects that need a language that SpaCy does not support. The polyglot package provides a command-line interface as well as library access through pipeline methods.

TextBlob TextBlob is a python library that is often used for natural language processing (NLP) tasks such as voice tagging, noun phrase extraction, sentiment analysis, and classification. This library is based on the NLTK library. Its user-friendly interface provides access to basic NLP tasks such as sentiment analysis, word extraction, parsing, and many more. Flair Flair supports an increasing number of languages, you may apply the latest NLP models to your text, such as named entity recognition, part-of-speech tagging, and classification, as well as sense disambiguation and classification. It is a deep learning library built on top of PyTorch for NLP tasks. Flair natively provides pre-trained models for NLP tasks such asText classification, Part-of-Speech tagging and Name Entity Recognition

Paraphrasing refers to rewriting something in different words and using different expressions. It does not include changing the whole concept or meaning. It is a method in which we use words’ alternatives and different sentence structures. Paraphrasing is a restatement of any content or text. This is done by using a sentence re-phraser (Paraphraser). What is a good paraphrase? Almost all conditioned text generation models are validated on 2 factors, (1) if the generated text conveys the same meaning as the original context (Adequacy) (2) if the text is fluent / grammatically correct English (Fluency). For instance Neural Machine Translation outputs are tested for Adequacy and Fluency. But a good paraphrase should be adequate and fluent while being as different as possible on the surface lexical form. With respect to this definition, the 3 key metrics that measures the quality of paraphrases are:

  • Adequacy (Is the meaning preserved adequately?)
  • Fluency (Is the paraphrase fluent English?)
  • Diversity (Lexical / Phrasal / Syntactical) (How much has the paraphrase changed the original sentence?)
The aim of a paraphraser is to create paraphrases that are fluent and have the same meaning. There are many uses or applications of a Paraphraser:
  • Data Augmentation: Paraphrasing helps in augmenting/creating training data for Natural Language Understanding(NLU) models to build robust models for conversational engines by creating equivalent paraphrases for a particular phrase or sentence thereby creating a text corpus as training data.
  • Summarization: Paraphrasing helps to create summaries of a large text corpus for understanding the crux of the text corpus.
  • Sentence Rephrasing: Paraphrasing helps in generating sentences with similar context for a particular phrase/sentence. These rephrased sentences can be used to create plagiarism free content for articles, blogs etc.
  • A typical process flow to create training data by data augmentation using Paraphraser is picturized below:

Group Name 1

Troubleshooting

For Windows users: While you attempt to run the kit_installer batch file, you might be view a prompt from Microsoft Defender as below:

Artificial intelligence (AI) makes it possible for machines to learn from experience, adjust to new inputs and perform human-like tasks. Most AI examples that you hear about today – from chess-playing computers to self-driving cars – rely heavily on deep learning and natural language processing.

Group Name 1

Trending Discussions on Natural Language Processing

How can I convert this language to actual numbers and text?

Numpy: Get indices of boundaries in an array where starts of boundaries always start with a particular number; non-boundaries by a particular number

How to replace text between multiple tags based on character length

Remove duplicates from a tuple

ValueError: You must include at least one label and at least one sequence

RuntimeError: CUDA out of memory | Elastic Search

News article extract using requests,bs4 and newspaper packages. why doesn't links=soup.select(".r a") find anything?. This code was working earlier

Changing Stanza model directory for pyinstaller executable

Type-Token Ratio in Google Sheets: How to manipulate long strings of text (millions of characters)

a bug for tf.keras.layers.TextVectorization when built from saved configs and weights

QUESTION

How can I convert this language to actual numbers and text?

Asked 2022-Mar-06 at 13:11

I am working on natural language processing project with deep learning and I downloaded a word embedding file. The file is in .bin format. I can open that file with

1file = open("cbow.bin", "rb")
2

But when I type

1file = open("cbow.bin", "rb")
2file.read(100)
3

I get

1file = open("cbow.bin", "rb")
2file.read(100)
3b'4347907 300\n</s> H\xe1\xae:0\x16\xc1:\xbfX\xa7\xbaR8\x8f\xba\xa0\xd3\xee9K\xfe\x83::m\xa49\xbc\xbb\x938\xa4p\x9d\xbat\xdaA:UU\xbe\xba\x93_\xda9\x82N\x83\xb9\xaeG\xa7\xb9\xde\xdd\x90\xbaww$\xba\xfdba:\x14.\x84:R\xb8\x81:0\x96\x0b:\x96\xfc\x06'  
4

What is this language and How can I convert it into actual numbers and text using python?

ANSWER

Answered 2022-Mar-06 at 13:11

This weird language you are referring to is a python bytestring.

As @jolitti implied that you won't be able to convert this particular bytestring to readable text.

If the bytestring contained any characters you recognize then would have been displayed like this.

1file = open("cbow.bin", "rb")
2file.read(100)
3b'4347907 300\n</s> H\xe1\xae:0\x16\xc1:\xbfX\xa7\xbaR8\x8f\xba\xa0\xd3\xee9K\xfe\x83::m\xa49\xbc\xbb\x938\xa4p\x9d\xbat\xdaA:UU\xbe\xba\x93_\xda9\x82N\x83\xb9\xaeG\xa7\xb9\xde\xdd\x90\xbaww$\xba\xfdba:\x14.\x84:R\xb8\x81:0\x96\x0b:\x96\xfc\x06'  
4b'Guido van Rossum'
5

Source https://stackoverflow.com/questions/71370347

QUESTION

Numpy: Get indices of boundaries in an array where starts of boundaries always start with a particular number; non-boundaries by a particular number

Asked 2022-Mar-01 at 00:01

Problem:

The most computationally efficient solution to getting the indices of boundaries in an array where starts of boundaries always start with a particular number and non-boundaries are indicated by a different particular number.

Differences between this question and other boundary-based numpy questions on SO:

here are some other boundary based numpy questions

Numpy 1D array - find indices of boundaries of subsequences of the same number

Getting the boundary of numpy array shape with a hole

Extracting boundary of a numpy array

The difference between the question I am asking and other stackoverflow posts in my attempt to search for a solution is that the other boundaries are indicated by a jump in value, or a 'hole' of values.

What seems to be unique to my case is the starts of boundaries always start with a particular number.

Motivation:

This problem is inspired by IOB tagging in natural language processing. In IOB tagging, the start of a word is tagged with B [beginning] is the tag of the first letter in an entity, I [inside] is the tag for all other characters besides the first character in a word, and [O] is used to tag all non-entity characters

Example:

1import numpy as np
2
3a = np.array(
4    [
5     0, 0, 0, 1, 2, 2, 2, 0, 0, 0, 1, 0, 0, 1, 2, 1, 1, 0, 0, 1, 1, 1
6    ]
7)
8

1 is the start of each boundary. If a boundary has a length greater than one, then 2 makes up the rest of the boundary. 0 are non-boundary numbers.

The entities of these boundaries are 1, 2, 2, 2, 1, 1,2, 1, 1, 1, 1, 1

So the desired solution; the indices of the indices boundary values for a are

1import numpy as np
2
3a = np.array(
4    [
5     0, 0, 0, 1, 2, 2, 2, 0, 0, 0, 1, 0, 0, 1, 2, 1, 1, 0, 0, 1, 1, 1
6    ]
7)
8desired = [[3, 6], [10, 10], [13, 14], [15, 15], [16,16], [19,19], [20,20], [21,21]]
9

Current Solution:

If flattened, the numbers in the desired solution are in ascending order. So the raw indices numbers can be calculated, sorted, and reshaped later.

I can get the start indices using

1import numpy as np
2
3a = np.array(
4    [
5     0, 0, 0, 1, 2, 2, 2, 0, 0, 0, 1, 0, 0, 1, 2, 1, 1, 0, 0, 1, 1, 1
6    ]
7)
8desired = [[3, 6], [10, 10], [13, 14], [15, 15], [16,16], [19,19], [20,20], [21,21]]
9starts = np.where(a==1)[0]
10starts
11
1import numpy as np
2
3a = np.array(
4    [
5     0, 0, 0, 1, 2, 2, 2, 0, 0, 0, 1, 0, 0, 1, 2, 1, 1, 0, 0, 1, 1, 1
6    ]
7)
8desired = [[3, 6], [10, 10], [13, 14], [15, 15], [16,16], [19,19], [20,20], [21,21]]
9starts = np.where(a==1)[0]
10starts
11array([ 3, 10, 13, 15, 16, 19, 20, 21])
12

So what's left is 6, 10, 14, 15,16,19,20,21

I can get all except 1 using 3 different conditionals where I can compare a shifted array to the original by decreases in values and the values of the non-shifted array.

1import numpy as np
2
3a = np.array(
4    [
5     0, 0, 0, 1, 2, 2, 2, 0, 0, 0, 1, 0, 0, 1, 2, 1, 1, 0, 0, 1, 1, 1
6    ]
7)
8desired = [[3, 6], [10, 10], [13, 14], [15, 15], [16,16], [19,19], [20,20], [21,21]]
9starts = np.where(a==1)[0]
10starts
11array([ 3, 10, 13, 15, 16, 19, 20, 21])
12first = np.where(a[:-1] - 2 == a[1:])[0]
13first 
14
1import numpy as np
2
3a = np.array(
4    [
5     0, 0, 0, 1, 2, 2, 2, 0, 0, 0, 1, 0, 0, 1, 2, 1, 1, 0, 0, 1, 1, 1
6    ]
7)
8desired = [[3, 6], [10, 10], [13, 14], [15, 15], [16,16], [19,19], [20,20], [21,21]]
9starts = np.where(a==1)[0]
10starts
11array([ 3, 10, 13, 15, 16, 19, 20, 21])
12first = np.where(a[:-1] - 2 == a[1:])[0]
13first 
14array([6])
15
1import numpy as np
2
3a = np.array(
4    [
5     0, 0, 0, 1, 2, 2, 2, 0, 0, 0, 1, 0, 0, 1, 2, 1, 1, 0, 0, 1, 1, 1
6    ]
7)
8desired = [[3, 6], [10, 10], [13, 14], [15, 15], [16,16], [19,19], [20,20], [21,21]]
9starts = np.where(a==1)[0]
10starts
11array([ 3, 10, 13, 15, 16, 19, 20, 21])
12first = np.where(a[:-1] - 2 == a[1:])[0]
13first 
14array([6])
15second = np.where((a[:-1] - 1 == a[1:]) & 
16    ((a[1:]==1) | (a[1:]==0)))[0]
17second
18
1import numpy as np
2
3a = np.array(
4    [
5     0, 0, 0, 1, 2, 2, 2, 0, 0, 0, 1, 0, 0, 1, 2, 1, 1, 0, 0, 1, 1, 1
6    ]
7)
8desired = [[3, 6], [10, 10], [13, 14], [15, 15], [16,16], [19,19], [20,20], [21,21]]
9starts = np.where(a==1)[0]
10starts
11array([ 3, 10, 13, 15, 16, 19, 20, 21])
12first = np.where(a[:-1] - 2 == a[1:])[0]
13first 
14array([6])
15second = np.where((a[:-1] - 1 == a[1:]) & 
16    ((a[1:]==1) | (a[1:]==0)))[0]
17second
18array([10, 14, 16])
19
1import numpy as np
2
3a = np.array(
4    [
5     0, 0, 0, 1, 2, 2, 2, 0, 0, 0, 1, 0, 0, 1, 2, 1, 1, 0, 0, 1, 1, 1
6    ]
7)
8desired = [[3, 6], [10, 10], [13, 14], [15, 15], [16,16], [19,19], [20,20], [21,21]]
9starts = np.where(a==1)[0]
10starts
11array([ 3, 10, 13, 15, 16, 19, 20, 21])
12first = np.where(a[:-1] - 2 == a[1:])[0]
13first 
14array([6])
15second = np.where((a[:-1] - 1 == a[1:]) & 
16    ((a[1:]==1) | (a[1:]==0)))[0]
17second
18array([10, 14, 16])
19third = np.where(
20    (a[:-1] == a[1:]) &
21    (a[1:]==1)
22    )[0]
23third
24
1import numpy as np
2
3a = np.array(
4    [
5     0, 0, 0, 1, 2, 2, 2, 0, 0, 0, 1, 0, 0, 1, 2, 1, 1, 0, 0, 1, 1, 1
6    ]
7)
8desired = [[3, 6], [10, 10], [13, 14], [15, 15], [16,16], [19,19], [20,20], [21,21]]
9starts = np.where(a==1)[0]
10starts
11array([ 3, 10, 13, 15, 16, 19, 20, 21])
12first = np.where(a[:-1] - 2 == a[1:])[0]
13first 
14array([6])
15second = np.where((a[:-1] - 1 == a[1:]) & 
16    ((a[1:]==1) | (a[1:]==0)))[0]
17second
18array([10, 14, 16])
19third = np.where(
20    (a[:-1] == a[1:]) &
21    (a[1:]==1)
22    )[0]
23third
24array([15, 19, 20])
25

The last number I need is 21, but since I needed to shorten the length of the array by 1 to do the shifted comparisons, I'm not sure how to get that particular value using logic, so I just used a simple if statement for that.

Using the rest of the retrieved values for the indices, I can concatenate all the values and reshape them.

1import numpy as np
2
3a = np.array(
4    [
5     0, 0, 0, 1, 2, 2, 2, 0, 0, 0, 1, 0, 0, 1, 2, 1, 1, 0, 0, 1, 1, 1
6    ]
7)
8desired = [[3, 6], [10, 10], [13, 14], [15, 15], [16,16], [19,19], [20,20], [21,21]]
9starts = np.where(a==1)[0]
10starts
11array([ 3, 10, 13, 15, 16, 19, 20, 21])
12first = np.where(a[:-1] - 2 == a[1:])[0]
13first 
14array([6])
15second = np.where((a[:-1] - 1 == a[1:]) & 
16    ((a[1:]==1) | (a[1:]==0)))[0]
17second
18array([10, 14, 16])
19third = np.where(
20    (a[:-1] == a[1:]) &
21    (a[1:]==1)
22    )[0]
23third
24array([15, 19, 20])
25if (a[-1] == 1) | (a[-1] == 2):
26    pen = np.concatenate((
27        starts, first, second, third, np.array([a.shape[0]-1])
28    ))
29else:
30    pen = np.concatenate((
31        starts, first, second, third, 
32    ))
33np.sort(pen).reshape(-1,2)
34
1import numpy as np
2
3a = np.array(
4    [
5     0, 0, 0, 1, 2, 2, 2, 0, 0, 0, 1, 0, 0, 1, 2, 1, 1, 0, 0, 1, 1, 1
6    ]
7)
8desired = [[3, 6], [10, 10], [13, 14], [15, 15], [16,16], [19,19], [20,20], [21,21]]
9starts = np.where(a==1)[0]
10starts
11array([ 3, 10, 13, 15, 16, 19, 20, 21])
12first = np.where(a[:-1] - 2 == a[1:])[0]
13first 
14array([6])
15second = np.where((a[:-1] - 1 == a[1:]) & 
16    ((a[1:]==1) | (a[1:]==0)))[0]
17second
18array([10, 14, 16])
19third = np.where(
20    (a[:-1] == a[1:]) &
21    (a[1:]==1)
22    )[0]
23third
24array([15, 19, 20])
25if (a[-1] == 1) | (a[-1] == 2):
26    pen = np.concatenate((
27        starts, first, second, third, np.array([a.shape[0]-1])
28    ))
29else:
30    pen = np.concatenate((
31        starts, first, second, third, 
32    ))
33np.sort(pen).reshape(-1,2)
34array([[ 3,  6],
35       [10, 10],
36       [13, 14],
37       [15, 15],
38       [16, 16],
39       [19, 19],
40       [20, 20],
41       [21, 21]])
42

Is this the most computationally efficient solution for my answer? I realize the four where statements can be combined with or operators but wanted to have each separate for the reader to see each result in this post. But I am wondering if there is a more computationally efficient solution since I have not mastered all of numpy's functions and am unsure of the computational efficiency of each.

ANSWER

Answered 2022-Mar-01 at 00:01

A standard trick for this type of problem is to pad the input appropriately. In this case, it is helpful to append a 0 to the end of the array:

1import numpy as np
2
3a = np.array(
4    [
5     0, 0, 0, 1, 2, 2, 2, 0, 0, 0, 1, 0, 0, 1, 2, 1, 1, 0, 0, 1, 1, 1
6    ]
7)
8desired = [[3, 6], [10, 10], [13, 14], [15, 15], [16,16], [19,19], [20,20], [21,21]]
9starts = np.where(a==1)[0]
10starts
11array([ 3, 10, 13, 15, 16, 19, 20, 21])
12first = np.where(a[:-1] - 2 == a[1:])[0]
13first 
14array([6])
15second = np.where((a[:-1] - 1 == a[1:]) & 
16    ((a[1:]==1) | (a[1:]==0)))[0]
17second
18array([10, 14, 16])
19third = np.where(
20    (a[:-1] == a[1:]) &
21    (a[1:]==1)
22    )[0]
23third
24array([15, 19, 20])
25if (a[-1] == 1) | (a[-1] == 2):
26    pen = np.concatenate((
27        starts, first, second, third, np.array([a.shape[0]-1])
28    ))
29else:
30    pen = np.concatenate((
31        starts, first, second, third, 
32    ))
33np.sort(pen).reshape(-1,2)
34array([[ 3,  6],
35       [10, 10],
36       [13, 14],
37       [15, 15],
38       [16, 16],
39       [19, 19],
40       [20, 20],
41       [21, 21]])
42In [55]: a1 = np.concatenate((a, [0]))
43
44In [56]: a1
45Out[56]: 
46array([0, 0, 0, 1, 2, 2, 2, 0, 0, 0, 1, 0, 0, 1, 2, 1, 1, 0, 0, 1, 1, 1,
47       0])
48

Then your starts calculation still works:

1import numpy as np
2
3a = np.array(
4    [
5     0, 0, 0, 1, 2, 2, 2, 0, 0, 0, 1, 0, 0, 1, 2, 1, 1, 0, 0, 1, 1, 1
6    ]
7)
8desired = [[3, 6], [10, 10], [13, 14], [15, 15], [16,16], [19,19], [20,20], [21,21]]
9starts = np.where(a==1)[0]
10starts
11array([ 3, 10, 13, 15, 16, 19, 20, 21])
12first = np.where(a[:-1] - 2 == a[1:])[0]
13first 
14array([6])
15second = np.where((a[:-1] - 1 == a[1:]) & 
16    ((a[1:]==1) | (a[1:]==0)))[0]
17second
18array([10, 14, 16])
19third = np.where(
20    (a[:-1] == a[1:]) &
21    (a[1:]==1)
22    )[0]
23third
24array([15, 19, 20])
25if (a[-1] == 1) | (a[-1] == 2):
26    pen = np.concatenate((
27        starts, first, second, third, np.array([a.shape[0]-1])
28    ))
29else:
30    pen = np.concatenate((
31        starts, first, second, third, 
32    ))
33np.sort(pen).reshape(-1,2)
34array([[ 3,  6],
35       [10, 10],
36       [13, 14],
37       [15, 15],
38       [16, 16],
39       [19, 19],
40       [20, 20],
41       [21, 21]])
42In [55]: a1 = np.concatenate((a, [0]))
43
44In [56]: a1
45Out[56]: 
46array([0, 0, 0, 1, 2, 2, 2, 0, 0, 0, 1, 0, 0, 1, 2, 1, 1, 0, 0, 1, 1, 1,
47       0])
48In [57]: starts = np.where(a1 == 1)[0]
49
50In [58]: starts
51Out[58]: array([ 3, 10, 13, 15, 16, 19, 20, 21])
52

The condition for the end is that the value is a 1 or a 2 followed by a value that is not 2. You've already figured out that to handle the "followed by" condition, you can use a shifted version of the array. To implement the and and or conditions, use the bitwise binary operators & and |, respectiveley. In code, it looks like:

1import numpy as np
2
3a = np.array(
4    [
5     0, 0, 0, 1, 2, 2, 2, 0, 0, 0, 1, 0, 0, 1, 2, 1, 1, 0, 0, 1, 1, 1
6    ]
7)
8desired = [[3, 6], [10, 10], [13, 14], [15, 15], [16,16], [19,19], [20,20], [21,21]]
9starts = np.where(a==1)[0]
10starts
11array([ 3, 10, 13, 15, 16, 19, 20, 21])
12first = np.where(a[:-1] - 2 == a[1:])[0]
13first 
14array([6])
15second = np.where((a[:-1] - 1 == a[1:]) & 
16    ((a[1:]==1) | (a[1:]==0)))[0]
17second
18array([10, 14, 16])
19third = np.where(
20    (a[:-1] == a[1:]) &
21    (a[1:]==1)
22    )[0]
23third
24array([15, 19, 20])
25if (a[-1] == 1) | (a[-1] == 2):
26    pen = np.concatenate((
27        starts, first, second, third, np.array([a.shape[0]-1])
28    ))
29else:
30    pen = np.concatenate((
31        starts, first, second, third, 
32    ))
33np.sort(pen).reshape(-1,2)
34array([[ 3,  6],
35       [10, 10],
36       [13, 14],
37       [15, 15],
38       [16, 16],
39       [19, 19],
40       [20, 20],
41       [21, 21]])
42In [55]: a1 = np.concatenate((a, [0]))
43
44In [56]: a1
45Out[56]: 
46array([0, 0, 0, 1, 2, 2, 2, 0, 0, 0, 1, 0, 0, 1, 2, 1, 1, 0, 0, 1, 1, 1,
47       0])
48In [57]: starts = np.where(a1 == 1)[0]
49
50In [58]: starts
51Out[58]: array([ 3, 10, 13, 15, 16, 19, 20, 21])
52In [61]: ends = np.where((a1[:-1] != 0) & (a1[1:] != 2))[0]
53
54In [62]: ends
55Out[62]: array([ 6, 10, 14, 15, 16, 19, 20, 21])
56

Finally, put starts and ends into a single array:

1import numpy as np
2
3a = np.array(
4    [
5     0, 0, 0, 1, 2, 2, 2, 0, 0, 0, 1, 0, 0, 1, 2, 1, 1, 0, 0, 1, 1, 1
6    ]
7)
8desired = [[3, 6], [10, 10], [13, 14], [15, 15], [16,16], [19,19], [20,20], [21,21]]
9starts = np.where(a==1)[0]
10starts
11array([ 3, 10, 13, 15, 16, 19, 20, 21])
12first = np.where(a[:-1] - 2 == a[1:])[0]
13first 
14array([6])
15second = np.where((a[:-1] - 1 == a[1:]) & 
16    ((a[1:]==1) | (a[1:]==0)))[0]
17second
18array([10, 14, 16])
19third = np.where(
20    (a[:-1] == a[1:]) &
21    (a[1:]==1)
22    )[0]
23third
24array([15, 19, 20])
25if (a[-1] == 1) | (a[-1] == 2):
26    pen = np.concatenate((
27        starts, first, second, third, np.array([a.shape[0]-1])
28    ))
29else:
30    pen = np.concatenate((
31        starts, first, second, third, 
32    ))
33np.sort(pen).reshape(-1,2)
34array([[ 3,  6],
35       [10, 10],
36       [13, 14],
37       [15, 15],
38       [16, 16],
39       [19, 19],
40       [20, 20],
41       [21, 21]])
42In [55]: a1 = np.concatenate((a, [0]))
43
44In [56]: a1
45Out[56]: 
46array([0, 0, 0, 1, 2, 2, 2, 0, 0, 0, 1, 0, 0, 1, 2, 1, 1, 0, 0, 1, 1, 1,
47       0])
48In [57]: starts = np.where(a1 == 1)[0]
49
50In [58]: starts
51Out[58]: array([ 3, 10, 13, 15, 16, 19, 20, 21])
52In [61]: ends = np.where((a1[:-1] != 0) & (a1[1:] != 2))[0]
53
54In [62]: ends
55Out[62]: array([ 6, 10, 14, 15, 16, 19, 20, 21])
56In [63]: np.column_stack((starts, ends))
57Out[63]: 
58array([[ 3,  6],
59       [10, 10],
60       [13, 14],
61       [15, 15],
62       [16, 16],
63       [19, 19],
64       [20, 20],
65       [21, 21]])
66

Source https://stackoverflow.com/questions/71294303

QUESTION

How to replace text between multiple tags based on character length

Asked 2022-Feb-11 at 14:53

I am dealing with dirty text data (and not with valid html). I am doing natural language processing and short code snippets shouldn't be removed because they can contain valuable information while long code snippets don't.

Thats why I would like to remove text between code tags only if the content that will be removed has character length > n.

Let's say the number of allowed characters between two code tags is n <= 5. Then everything between those tags that is longer than 5 characters will be removed.

My approach so far deletes all of the code characters:

1text = &quot;This is a string &lt;code&gt;1234&lt;/code&gt; another string &lt;code&gt;123&lt;/code&gt; another string &lt;code&gt;123456789&lt;/code&gt; another string.&quot;
2text = re.sub(&quot;&lt;code&gt;.*?&lt;/code&gt;&quot;, '', text)
3print(text)
4
5Output: This is a string  another string  another string  another string.
6

The desired output:

1text = &quot;This is a string &lt;code&gt;1234&lt;/code&gt; another string &lt;code&gt;123&lt;/code&gt; another string &lt;code&gt;123456789&lt;/code&gt; another string.&quot;
2text = re.sub(&quot;&lt;code&gt;.*?&lt;/code&gt;&quot;, '', text)
3print(text)
4
5Output: This is a string  another string  another string  another string.
6&quot;This is a string &lt;code&gt;1234&lt;/code&gt; another string &lt;code&gt;123&lt;/code&gt; another string another string.&quot;
7

Is there a way to count the text length for all of the appearing <code ... </code> tags before it will actually be removed?

ANSWER

Answered 2022-Feb-11 at 14:53

In Python, BeautifulSoup is often used to manipulate HTML/XML contents. If you use this library, you can use something like

1text = &quot;This is a string &lt;code&gt;1234&lt;/code&gt; another string &lt;code&gt;123&lt;/code&gt; another string &lt;code&gt;123456789&lt;/code&gt; another string.&quot;
2text = re.sub(&quot;&lt;code&gt;.*?&lt;/code&gt;&quot;, '', text)
3print(text)
4
5Output: This is a string  another string  another string  another string.
6&quot;This is a string &lt;code&gt;1234&lt;/code&gt; another string &lt;code&gt;123&lt;/code&gt; another string another string.&quot;
7from bs4 import BeautifulSoup
8soup = BeautifulSoup(content,&quot;html.parser&quot;)
9text = &quot;This is a string &lt;code&gt;1234&lt;/code&gt; another string &lt;code&gt;123&lt;/code&gt; another string &lt;code&gt;123456789&lt;/code&gt; another string.&quot;
10soup = BeautifulSoup(text,&quot;html.parser&quot;)
11for code in soup.find_all(&quot;code&quot;):
12    if len(code.encode_contents()) &gt; 5: # Check the inner HTML length
13        code.extract()                  # Remove the node found
14
15print(str(soup))
16# =&gt; This is a string &lt;code&gt;1234&lt;/code&gt; another string &lt;code&gt;123&lt;/code&gt; another string  another string.
17

Note that here, the length of the inner HTML part is taken into account, not the inner text.

With regex, you can use a negated character class pattern, [^<], to match any char other than <, and apply a limiting quantifier to it. If all longer than 5 chars should be removed, use {6,} quantifier:

1text = &quot;This is a string &lt;code&gt;1234&lt;/code&gt; another string &lt;code&gt;123&lt;/code&gt; another string &lt;code&gt;123456789&lt;/code&gt; another string.&quot;
2text = re.sub(&quot;&lt;code&gt;.*?&lt;/code&gt;&quot;, '', text)
3print(text)
4
5Output: This is a string  another string  another string  another string.
6&quot;This is a string &lt;code&gt;1234&lt;/code&gt; another string &lt;code&gt;123&lt;/code&gt; another string another string.&quot;
7from bs4 import BeautifulSoup
8soup = BeautifulSoup(content,&quot;html.parser&quot;)
9text = &quot;This is a string &lt;code&gt;1234&lt;/code&gt; another string &lt;code&gt;123&lt;/code&gt; another string &lt;code&gt;123456789&lt;/code&gt; another string.&quot;
10soup = BeautifulSoup(text,&quot;html.parser&quot;)
11for code in soup.find_all(&quot;code&quot;):
12    if len(code.encode_contents()) &gt; 5: # Check the inner HTML length
13        code.extract()                  # Remove the node found
14
15print(str(soup))
16# =&gt; This is a string &lt;code&gt;1234&lt;/code&gt; another string &lt;code&gt;123&lt;/code&gt; another string  another string.
17import re
18text = &quot;This is a string &lt;code&gt;1234&lt;/code&gt; another string &lt;code&gt;123&lt;/code&gt; another string &lt;code&gt;123456789&lt;/code&gt; another string.&quot;
19text = re.sub(r'&lt;code&gt;[^&gt;]{6,}&lt;/code&gt;', '', text)
20print(text)
21# =&gt; This is a string &lt;code&gt;1234&lt;/code&gt; another string &lt;code&gt;123&lt;/code&gt; another string  another string.
22

Source https://stackoverflow.com/questions/71081724

QUESTION

Remove duplicates from a tuple

Asked 2022-Feb-09 at 23:43

I tried to extract keywords from a text. By using "en_core_sci_lg" model, I got a tuple type of phrases/words with some duplicates which I tried to remove from it. I tried deduplicate function for list and tuple, I only got fail. Can anyone help? I really appreciate it.

1text = &quot;&quot;&quot;spaCy is an open-source software library for advanced natural language processing,
2written in the programming languages Python and Cython. The MIT library is published under the MIT license and its main developers are Matthew Honnibal and Ines Honnibal, the founders of the software company Explosion.&quot;&quot;&quot;
3

one sets of codes I have tried:

1text = &quot;&quot;&quot;spaCy is an open-source software library for advanced natural language processing,
2written in the programming languages Python and Cython. The MIT library is published under the MIT license and its main developers are Matthew Honnibal and Ines Honnibal, the founders of the software company Explosion.&quot;&quot;&quot;
3import spacy
4nlp = spacy.load(&quot;en_core_sci_lg&quot;)
5
6doc = nlp(text)
7my_tuple = list(set(doc.ents))
8print('original tuple', doc.ents, len(doc.ents))
9print('after set function', my_tuple, len(my_tuple))
10

the output:

1text = &quot;&quot;&quot;spaCy is an open-source software library for advanced natural language processing,
2written in the programming languages Python and Cython. The MIT library is published under the MIT license and its main developers are Matthew Honnibal and Ines Honnibal, the founders of the software company Explosion.&quot;&quot;&quot;
3import spacy
4nlp = spacy.load(&quot;en_core_sci_lg&quot;)
5
6doc = nlp(text)
7my_tuple = list(set(doc.ents))
8print('original tuple', doc.ents, len(doc.ents))
9print('after set function', my_tuple, len(my_tuple))
10original tuple: (spaCy, open-source software library, programming languages, Python, Cython, MIT, library, published, MIT, license, developers, Matthew Honnibal, Ines, Honnibal, founders, software company Explosion) 16
11
12after set function: [Honnibal, MIT, Ines, software company Explosion, founders, programming languages, library, Matthew Honnibal, license, Cython, Python, developers, MIT, published, open-source software library, spaCy] 16
13

the desired output is(there should be one MIT, and the name Ines Honnibal should be together):

1text = &quot;&quot;&quot;spaCy is an open-source software library for advanced natural language processing,
2written in the programming languages Python and Cython. The MIT library is published under the MIT license and its main developers are Matthew Honnibal and Ines Honnibal, the founders of the software company Explosion.&quot;&quot;&quot;
3import spacy
4nlp = spacy.load(&quot;en_core_sci_lg&quot;)
5
6doc = nlp(text)
7my_tuple = list(set(doc.ents))
8print('original tuple', doc.ents, len(doc.ents))
9print('after set function', my_tuple, len(my_tuple))
10original tuple: (spaCy, open-source software library, programming languages, Python, Cython, MIT, library, published, MIT, license, developers, Matthew Honnibal, Ines, Honnibal, founders, software company Explosion) 16
11
12after set function: [Honnibal, MIT, Ines, software company Explosion, founders, programming languages, library, Matthew Honnibal, license, Cython, Python, developers, MIT, published, open-source software library, spaCy] 16
13[Ines Honnibal, MIT, software company Explosion, founders, programming languages, library, Matthew Honnibal, license, Cython, Python, developers, published, open-source software library, spaCy]
14

ANSWER

Answered 2022-Feb-09 at 22:08

doc.ents is not a list of strings. It is a list of Span objects. When you print one, it prints its contents, but they are indeed individual objects, which is why set doesn't see they are duplicates. The clue to that is there are no quote marks in your print statement. If those were strings, you'd see quotation marks.

You should try using doc.words instead of doc.ents. If that doesn't work for you, for some reason, you can do:

1text = &quot;&quot;&quot;spaCy is an open-source software library for advanced natural language processing,
2written in the programming languages Python and Cython. The MIT library is published under the MIT license and its main developers are Matthew Honnibal and Ines Honnibal, the founders of the software company Explosion.&quot;&quot;&quot;
3import spacy
4nlp = spacy.load(&quot;en_core_sci_lg&quot;)
5
6doc = nlp(text)
7my_tuple = list(set(doc.ents))
8print('original tuple', doc.ents, len(doc.ents))
9print('after set function', my_tuple, len(my_tuple))
10original tuple: (spaCy, open-source software library, programming languages, Python, Cython, MIT, library, published, MIT, license, developers, Matthew Honnibal, Ines, Honnibal, founders, software company Explosion) 16
11
12after set function: [Honnibal, MIT, Ines, software company Explosion, founders, programming languages, library, Matthew Honnibal, license, Cython, Python, developers, MIT, published, open-source software library, spaCy] 16
13[Ines Honnibal, MIT, software company Explosion, founders, programming languages, library, Matthew Honnibal, license, Cython, Python, developers, published, open-source software library, spaCy]
14my_tuple = list(set(e.text for e in doc.ents))
15

Source https://stackoverflow.com/questions/71057313

QUESTION

ValueError: You must include at least one label and at least one sequence

Asked 2021-Dec-14 at 09:15

I'm using this Notebook, where section Apply DocumentClassifier is altered as below.

Jupyter Labs, kernel: conda_mxnet_latest_p37.

Error appears to be an ML standard practice response. However, I pass/ create the same parameter and the variable names as the original code. So it's something to do with their values in my code.


My Code:

1with open('filt_gri.txt', 'r') as filehandle:
2    tags = [current_place.rstrip() for current_place in filehandle.readlines()]
3
4doc_classifier = TransformersDocumentClassifier(model_name_or_path=&quot;cross-encoder/nli-distilroberta-base&quot;,
5                                                task=&quot;zero-shot-classification&quot;,
6                                                labels=tags,
7                                                batch_size=16)
8
9# convert to Document using a fieldmap for custom content fields the classification should run on
10docs_to_classify = [Document.from_dict(d) for d in docs_sliding_window]
11
12# classify using gpu, batch_size makes sure we do not run out of memory
13classified_docs = doc_classifier.predict(docs_to_classify)
14
15# let's see how it looks: there should be a classification result in the meta entry containing labels and scores.
16print(classified_docs[0].to_dict())
17
18all_docs = convert_files_to_dicts(dir_path=doc_dir)
19
20preprocessor_sliding_window = PreProcessor(split_overlap=3,
21                                           split_length=10,
22                                           split_respect_sentence_boundary=False,
23                                           split_by='passage')
24

Output:

1with open('filt_gri.txt', 'r') as filehandle:
2    tags = [current_place.rstrip() for current_place in filehandle.readlines()]
3
4doc_classifier = TransformersDocumentClassifier(model_name_or_path=&quot;cross-encoder/nli-distilroberta-base&quot;,
5                                                task=&quot;zero-shot-classification&quot;,
6                                                labels=tags,
7                                                batch_size=16)
8
9# convert to Document using a fieldmap for custom content fields the classification should run on
10docs_to_classify = [Document.from_dict(d) for d in docs_sliding_window]
11
12# classify using gpu, batch_size makes sure we do not run out of memory
13classified_docs = doc_classifier.predict(docs_to_classify)
14
15# let's see how it looks: there should be a classification result in the meta entry containing labels and scores.
16print(classified_docs[0].to_dict())
17
18all_docs = convert_files_to_dicts(dir_path=doc_dir)
19
20preprocessor_sliding_window = PreProcessor(split_overlap=3,
21                                           split_length=10,
22                                           split_respect_sentence_boundary=False,
23                                           split_by='passage')
24INFO - haystack.modeling.utils -  Using devices: CUDA
25INFO - haystack.modeling.utils -  Number of GPUs: 1
26---------------------------------------------------------------------------
27ValueError                                Traceback (most recent call last)
28&lt;ipython-input-11-77eb98038283&gt; in &lt;module&gt;
29     14 
30     15 # classify using gpu, batch_size makes sure we do not run out of memory
31---&gt; 16 classified_docs = doc_classifier.predict(docs_to_classify)
32     17 
33     18 # let's see how it looks: there should be a classification result in the meta entry containing labels and scores.
34
35~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/haystack/nodes/document_classifier/transformers.py in predict(self, documents)
36    137         batches = self.get_batches(texts, batch_size=self.batch_size)
37    138         if self.task == 'zero-shot-classification':
38--&gt; 139             batched_predictions = [self.model(batch, candidate_labels=self.labels, truncation=True) for batch in batches]
39    140         elif self.task == 'text-classification':
40    141             batched_predictions = [self.model(batch, return_all_scores=self.return_all_scores, truncation=True) for batch in batches]
41
42~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/haystack/nodes/document_classifier/transformers.py in &lt;listcomp&gt;(.0)
43    137         batches = self.get_batches(texts, batch_size=self.batch_size)
44    138         if self.task == 'zero-shot-classification':
45--&gt; 139             batched_predictions = [self.model(batch, candidate_labels=self.labels, truncation=True) for batch in batches]
46    140         elif self.task == 'text-classification':
47    141             batched_predictions = [self.model(batch, return_all_scores=self.return_all_scores, truncation=True) for batch in batches]
48
49~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/pipelines/zero_shot_classification.py in __call__(self, sequences, candidate_labels, hypothesis_template, multi_label, **kwargs)
50    151             sequences = [sequences]
51    152 
52--&gt; 153         outputs = super().__call__(sequences, candidate_labels, hypothesis_template)
53    154         num_sequences = len(sequences)
54    155         candidate_labels = self._args_parser._parse_labels(candidate_labels)
55
56~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/pipelines/base.py in __call__(self, *args, **kwargs)
57    758 
58    759     def __call__(self, *args, **kwargs):
59--&gt; 760         inputs = self._parse_and_tokenize(*args, **kwargs)
60    761         return self._forward(inputs)
61    762 
62
63~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/pipelines/zero_shot_classification.py in _parse_and_tokenize(self, sequences, candidate_labels, hypothesis_template, padding, add_special_tokens, truncation, **kwargs)
64     92         Parse arguments and tokenize only_first so that hypothesis (label) is not truncated
65     93         &quot;&quot;&quot;
66---&gt; 94         sequence_pairs = self._args_parser(sequences, candidate_labels, hypothesis_template)
67     95         inputs = self.tokenizer(
68     96             sequence_pairs,
69
70~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/pipelines/zero_shot_classification.py in __call__(self, sequences, labels, hypothesis_template)
71     25     def __call__(self, sequences, labels, hypothesis_template):
72     26         if len(labels) == 0 or len(sequences) == 0:
73---&gt; 27             raise ValueError(&quot;You must include at least one label and at least one sequence.&quot;)
74     28         if hypothesis_template.format(labels[0]) == hypothesis_template:
75     29             raise ValueError(
76
77ValueError: You must include at least one label and at least one sequence.
78

Original Code:

1with open('filt_gri.txt', 'r') as filehandle:
2    tags = [current_place.rstrip() for current_place in filehandle.readlines()]
3
4doc_classifier = TransformersDocumentClassifier(model_name_or_path=&quot;cross-encoder/nli-distilroberta-base&quot;,
5                                                task=&quot;zero-shot-classification&quot;,
6                                                labels=tags,
7                                                batch_size=16)
8
9# convert to Document using a fieldmap for custom content fields the classification should run on
10docs_to_classify = [Document.from_dict(d) for d in docs_sliding_window]
11
12# classify using gpu, batch_size makes sure we do not run out of memory
13classified_docs = doc_classifier.predict(docs_to_classify)
14
15# let's see how it looks: there should be a classification result in the meta entry containing labels and scores.
16print(classified_docs[0].to_dict())
17
18all_docs = convert_files_to_dicts(dir_path=doc_dir)
19
20preprocessor_sliding_window = PreProcessor(split_overlap=3,
21                                           split_length=10,
22                                           split_respect_sentence_boundary=False,
23                                           split_by='passage')
24INFO - haystack.modeling.utils -  Using devices: CUDA
25INFO - haystack.modeling.utils -  Number of GPUs: 1
26---------------------------------------------------------------------------
27ValueError                                Traceback (most recent call last)
28&lt;ipython-input-11-77eb98038283&gt; in &lt;module&gt;
29     14 
30     15 # classify using gpu, batch_size makes sure we do not run out of memory
31---&gt; 16 classified_docs = doc_classifier.predict(docs_to_classify)
32     17 
33     18 # let's see how it looks: there should be a classification result in the meta entry containing labels and scores.
34
35~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/haystack/nodes/document_classifier/transformers.py in predict(self, documents)
36    137         batches = self.get_batches(texts, batch_size=self.batch_size)
37    138         if self.task == 'zero-shot-classification':
38--&gt; 139             batched_predictions = [self.model(batch, candidate_labels=self.labels, truncation=True) for batch in batches]
39    140         elif self.task == 'text-classification':
40    141             batched_predictions = [self.model(batch, return_all_scores=self.return_all_scores, truncation=True) for batch in batches]
41
42~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/haystack/nodes/document_classifier/transformers.py in &lt;listcomp&gt;(.0)
43    137         batches = self.get_batches(texts, batch_size=self.batch_size)
44    138         if self.task == 'zero-shot-classification':
45--&gt; 139             batched_predictions = [self.model(batch, candidate_labels=self.labels, truncation=True) for batch in batches]
46    140         elif self.task == 'text-classification':
47    141             batched_predictions = [self.model(batch, return_all_scores=self.return_all_scores, truncation=True) for batch in batches]
48
49~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/pipelines/zero_shot_classification.py in __call__(self, sequences, candidate_labels, hypothesis_template, multi_label, **kwargs)
50    151             sequences = [sequences]
51    152 
52--&gt; 153         outputs = super().__call__(sequences, candidate_labels, hypothesis_template)
53    154         num_sequences = len(sequences)
54    155         candidate_labels = self._args_parser._parse_labels(candidate_labels)
55
56~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/pipelines/base.py in __call__(self, *args, **kwargs)
57    758 
58    759     def __call__(self, *args, **kwargs):
59--&gt; 760         inputs = self._parse_and_tokenize(*args, **kwargs)
60    761         return self._forward(inputs)
61    762 
62
63~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/pipelines/zero_shot_classification.py in _parse_and_tokenize(self, sequences, candidate_labels, hypothesis_template, padding, add_special_tokens, truncation, **kwargs)
64     92         Parse arguments and tokenize only_first so that hypothesis (label) is not truncated
65     93         &quot;&quot;&quot;
66---&gt; 94         sequence_pairs = self._args_parser(sequences, candidate_labels, hypothesis_template)
67     95         inputs = self.tokenizer(
68     96             sequence_pairs,
69
70~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/pipelines/zero_shot_classification.py in __call__(self, sequences, labels, hypothesis_template)
71     25     def __call__(self, sequences, labels, hypothesis_template):
72     26         if len(labels) == 0 or len(sequences) == 0:
73---&gt; 27             raise ValueError(&quot;You must include at least one label and at least one sequence.&quot;)
74     28         if hypothesis_template.format(labels[0]) == hypothesis_template:
75     29             raise ValueError(
76
77ValueError: You must include at least one label and at least one sequence.
78doc_classifier = TransformersDocumentClassifier(model_name_or_path=&quot;cross-encoder/nli-distilroberta-base&quot;,
79    task=&quot;zero-shot-classification&quot;,
80    labels=[&quot;music&quot;, &quot;natural language processing&quot;, &quot;history&quot;],
81    batch_size=16
82)
83
84# ----------
85
86# convert to Document using a fieldmap for custom content fields the classification should run on
87docs_to_classify = [Document.from_dict(d) for d in docs_sliding_window]
88
89# ----------
90
91# classify using gpu, batch_size makes sure we do not run out of memory
92classified_docs = doc_classifier.predict(docs_to_classify)
93
94# ----------
95
96# let's see how it looks: there should be a classification result in the meta entry containing labels and scores.
97print(classified_docs[0].to_dict())
98

Please let me know if there is anything else I should add to post/ clarify.

ANSWER

Answered 2021-Dec-08 at 21:05

 Reading official docs and analyzing that the error is generated when calling .predict(docs_to_classify) I could recommend that you try to do basic tests such as using the parameter labels = ["negative", "positive"] , and correct if it is caused by string values of the external file and optionally you should also check where it indicates the use of pipelines.

1with open('filt_gri.txt', 'r') as filehandle:
2    tags = [current_place.rstrip() for current_place in filehandle.readlines()]
3
4doc_classifier = TransformersDocumentClassifier(model_name_or_path=&quot;cross-encoder/nli-distilroberta-base&quot;,
5                                                task=&quot;zero-shot-classification&quot;,
6                                                labels=tags,
7                                                batch_size=16)
8
9# convert to Document using a fieldmap for custom content fields the classification should run on
10docs_to_classify = [Document.from_dict(d) for d in docs_sliding_window]
11
12# classify using gpu, batch_size makes sure we do not run out of memory
13classified_docs = doc_classifier.predict(docs_to_classify)
14
15# let's see how it looks: there should be a classification result in the meta entry containing labels and scores.
16print(classified_docs[0].to_dict())
17
18all_docs = convert_files_to_dicts(dir_path=doc_dir)
19
20preprocessor_sliding_window = PreProcessor(split_overlap=3,
21                                           split_length=10,
22                                           split_respect_sentence_boundary=False,
23                                           split_by='passage')
24INFO - haystack.modeling.utils -  Using devices: CUDA
25INFO - haystack.modeling.utils -  Number of GPUs: 1
26---------------------------------------------------------------------------
27ValueError                                Traceback (most recent call last)
28&lt;ipython-input-11-77eb98038283&gt; in &lt;module&gt;
29     14 
30     15 # classify using gpu, batch_size makes sure we do not run out of memory
31---&gt; 16 classified_docs = doc_classifier.predict(docs_to_classify)
32     17 
33     18 # let's see how it looks: there should be a classification result in the meta entry containing labels and scores.
34
35~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/haystack/nodes/document_classifier/transformers.py in predict(self, documents)
36    137         batches = self.get_batches(texts, batch_size=self.batch_size)
37    138         if self.task == 'zero-shot-classification':
38--&gt; 139             batched_predictions = [self.model(batch, candidate_labels=self.labels, truncation=True) for batch in batches]
39    140         elif self.task == 'text-classification':
40    141             batched_predictions = [self.model(batch, return_all_scores=self.return_all_scores, truncation=True) for batch in batches]
41
42~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/haystack/nodes/document_classifier/transformers.py in &lt;listcomp&gt;(.0)
43    137         batches = self.get_batches(texts, batch_size=self.batch_size)
44    138         if self.task == 'zero-shot-classification':
45--&gt; 139             batched_predictions = [self.model(batch, candidate_labels=self.labels, truncation=True) for batch in batches]
46    140         elif self.task == 'text-classification':
47    141             batched_predictions = [self.model(batch, return_all_scores=self.return_all_scores, truncation=True) for batch in batches]
48
49~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/pipelines/zero_shot_classification.py in __call__(self, sequences, candidate_labels, hypothesis_template, multi_label, **kwargs)
50    151             sequences = [sequences]
51    152 
52--&gt; 153         outputs = super().__call__(sequences, candidate_labels, hypothesis_template)
53    154         num_sequences = len(sequences)
54    155         candidate_labels = self._args_parser._parse_labels(candidate_labels)
55
56~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/pipelines/base.py in __call__(self, *args, **kwargs)
57    758 
58    759     def __call__(self, *args, **kwargs):
59--&gt; 760         inputs = self._parse_and_tokenize(*args, **kwargs)
60    761         return self._forward(inputs)
61    762 
62
63~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/pipelines/zero_shot_classification.py in _parse_and_tokenize(self, sequences, candidate_labels, hypothesis_template, padding, add_special_tokens, truncation, **kwargs)
64     92         Parse arguments and tokenize only_first so that hypothesis (label) is not truncated
65     93         &quot;&quot;&quot;
66---&gt; 94         sequence_pairs = self._args_parser(sequences, candidate_labels, hypothesis_template)
67     95         inputs = self.tokenizer(
68     96             sequence_pairs,
69
70~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/pipelines/zero_shot_classification.py in __call__(self, sequences, labels, hypothesis_template)
71     25     def __call__(self, sequences, labels, hypothesis_template):
72     26         if len(labels) == 0 or len(sequences) == 0:
73---&gt; 27             raise ValueError(&quot;You must include at least one label and at least one sequence.&quot;)
74     28         if hypothesis_template.format(labels[0]) == hypothesis_template:
75     29             raise ValueError(
76
77ValueError: You must include at least one label and at least one sequence.
78doc_classifier = TransformersDocumentClassifier(model_name_or_path=&quot;cross-encoder/nli-distilroberta-base&quot;,
79    task=&quot;zero-shot-classification&quot;,
80    labels=[&quot;music&quot;, &quot;natural language processing&quot;, &quot;history&quot;],
81    batch_size=16
82)
83
84# ----------
85
86# convert to Document using a fieldmap for custom content fields the classification should run on
87docs_to_classify = [Document.from_dict(d) for d in docs_sliding_window]
88
89# ----------
90
91# classify using gpu, batch_size makes sure we do not run out of memory
92classified_docs = doc_classifier.predict(docs_to_classify)
93
94# ----------
95
96# let's see how it looks: there should be a classification result in the meta entry containing labels and scores.
97print(classified_docs[0].to_dict())
98pipeline = Pipeline()
99pipeline.add_node(component=retriever, name=&quot;Retriever&quot;, inputs=[&quot;Query&quot;])
100pipeline.add_node(component=doc_classifier, name='DocClassifier', inputs=['Retriever'])
101

Source https://stackoverflow.com/questions/70278323

QUESTION

RuntimeError: CUDA out of memory | Elastic Search

Asked 2021-Dec-09 at 11:53

I'm fairly new to Machine Learning. I've successfully solved errors to do with parameters and model setup.

I'm using this Notebook, where section Apply DocumentClassifier is altered as below.

Jupyter Labs, kernel: conda_mxnet_latest_p37.


Error seems to be more about my laptop's hardware, rather than my code being broken.

Update: I changed batch_size=4, it ran for ages only to crash.

What should be my standard approach to solving this error?


My Code:

1with open('filt_gri.txt', 'r') as filehandle:
2    tags = [current_place.rstrip() for current_place in filehandle.readlines()]
3
4doc_classifier = TransformersDocumentClassifier(model_name_or_path=&quot;cross-encoder/nli-distilroberta-base&quot;,
5                                                task=&quot;zero-shot-classification&quot;,
6                                                labels=tags,
7                                                batch_size=4)
8
9# convert to Document using a fieldmap for custom content fields the classification should run on
10docs_to_classify = [Document.from_dict(d) for d in docs_sliding_window]
11
12# classify using gpu, batch_size makes sure we do not run out of memory
13classified_docs = doc_classifier.predict(docs_to_classify)
14
15# let's see how it looks: there should be a classification result in the meta entry containing labels and scores.
16print(classified_docs[0].to_dict())
17
18all_docs = convert_files_to_dicts(dir_path=doc_dir)
19
20preprocessor_sliding_window = PreProcessor(split_overlap=3,
21                                           split_length=10,
22                                           split_respect_sentence_boundary=False,
23                                           split_by='passage')
24

Error:

1with open('filt_gri.txt', 'r') as filehandle:
2    tags = [current_place.rstrip() for current_place in filehandle.readlines()]
3
4doc_classifier = TransformersDocumentClassifier(model_name_or_path=&quot;cross-encoder/nli-distilroberta-base&quot;,
5                                                task=&quot;zero-shot-classification&quot;,
6                                                labels=tags,
7                                                batch_size=4)
8
9# convert to Document using a fieldmap for custom content fields the classification should run on
10docs_to_classify = [Document.from_dict(d) for d in docs_sliding_window]
11
12# classify using gpu, batch_size makes sure we do not run out of memory
13classified_docs = doc_classifier.predict(docs_to_classify)
14
15# let's see how it looks: there should be a classification result in the meta entry containing labels and scores.
16print(classified_docs[0].to_dict())
17
18all_docs = convert_files_to_dicts(dir_path=doc_dir)
19
20preprocessor_sliding_window = PreProcessor(split_overlap=3,
21                                           split_length=10,
22                                           split_respect_sentence_boundary=False,
23                                           split_by='passage')
24INFO - haystack.modeling.utils -  Using devices: CUDA
25INFO - haystack.modeling.utils -  Using devices: CUDA
26INFO - haystack.modeling.utils -  Number of GPUs: 1
27INFO - haystack.modeling.utils -  Number of GPUs: 1
28---------------------------------------------------------------------------
29RuntimeError                              Traceback (most recent call last)
30&lt;ipython-input-25-27dfca549a7d&gt; in &lt;module&gt;
31     14 
32     15 # classify using gpu, batch_size makes sure we do not run out of memory
33---&gt; 16 classified_docs = doc_classifier.predict(docs_to_classify)
34     17 
35     18 # let's see how it looks: there should be a classification result in the meta entry containing labels and scores.
36
37~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/haystack/nodes/document_classifier/transformers.py in predict(self, documents)
38    137         batches = self.get_batches(texts, batch_size=self.batch_size)
39    138         if self.task == 'zero-shot-classification':
40--&gt; 139             batched_predictions = [self.model(batch, candidate_labels=self.labels, truncation=True) for batch in batches]
41    140         elif self.task == 'text-classification':
42    141             batched_predictions = [self.model(batch, return_all_scores=self.return_all_scores, truncation=True) for batch in batches]
43
44~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/haystack/nodes/document_classifier/transformers.py in &lt;listcomp&gt;(.0)
45    137         batches = self.get_batches(texts, batch_size=self.batch_size)
46    138         if self.task == 'zero-shot-classification':
47--&gt; 139             batched_predictions = [self.model(batch, candidate_labels=self.labels, truncation=True) for batch in batches]
48    140         elif self.task == 'text-classification':
49    141             batched_predictions = [self.model(batch, return_all_scores=self.return_all_scores, truncation=True) for batch in batches]
50
51~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/pipelines/zero_shot_classification.py in __call__(self, sequences, candidate_labels, hypothesis_template, multi_label, **kwargs)
52    151             sequences = [sequences]
53    152 
54--&gt; 153         outputs = super().__call__(sequences, candidate_labels, hypothesis_template)
55    154         num_sequences = len(sequences)
56    155         candidate_labels = self._args_parser._parse_labels(candidate_labels)
57
58~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/pipelines/base.py in __call__(self, *args, **kwargs)
59    759     def __call__(self, *args, **kwargs):
60    760         inputs = self._parse_and_tokenize(*args, **kwargs)
61--&gt; 761         return self._forward(inputs)
62    762 
63    763     def _forward(self, inputs, return_tensors=False):
64
65~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/pipelines/base.py in _forward(self, inputs, return_tensors)
66    780                 with torch.no_grad():
67    781                     inputs = self.ensure_tensor_on_device(**inputs)
68--&gt; 782                     predictions = self.model(**inputs)[0].cpu()
69    783 
70    784         if return_tensors:
71
72~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
73   1100         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
74   1101                 or _global_forward_hooks or _global_forward_pre_hooks):
75-&gt; 1102             return forward_call(*input, **kwargs)
76   1103         # Do not call functions when jit is used
77   1104         full_backward_hooks, non_full_backward_hooks = [], []
78
79~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, labels, output_attentions, output_hidden_states, return_dict)
80   1162             output_attentions=output_attentions,
81   1163             output_hidden_states=output_hidden_states,
82-&gt; 1164             return_dict=return_dict,
83   1165         )
84   1166         sequence_output = outputs[0]
85
86~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
87   1100         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
88   1101                 or _global_forward_hooks or _global_forward_pre_hooks):
89-&gt; 1102             return forward_call(*input, **kwargs)
90   1103         # Do not call functions when jit is used
91   1104         full_backward_hooks, non_full_backward_hooks = [], []
92
93~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
94    823             output_attentions=output_attentions,
95    824             output_hidden_states=output_hidden_states,
96--&gt; 825             return_dict=return_dict,
97    826         )
98    827         sequence_output = encoder_outputs[0]
99
100~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
101   1100         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
102   1101                 or _global_forward_hooks or _global_forward_pre_hooks):
103-&gt; 1102             return forward_call(*input, **kwargs)
104   1103         # Do not call functions when jit is used
105   1104         full_backward_hooks, non_full_backward_hooks = [], []
106
107~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
108    513                     encoder_attention_mask,
109    514                     past_key_value,
110--&gt; 515                     output_attentions,
111    516                 )
112    517 
113
114~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
115   1100         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
116   1101                 or _global_forward_hooks or _global_forward_pre_hooks):
117-&gt; 1102             return forward_call(*input, **kwargs)
118   1103         # Do not call functions when jit is used
119   1104         full_backward_hooks, non_full_backward_hooks = [], []
120
121~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_value, output_attentions)
122    398             head_mask,
123    399             output_attentions=output_attentions,
124--&gt; 400             past_key_value=self_attn_past_key_value,
125    401         )
126    402         attention_output = self_attention_outputs[0]
127
128~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
129   1100         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
130   1101                 or _global_forward_hooks or _global_forward_pre_hooks):
131-&gt; 1102             return forward_call(*input, **kwargs)
132   1103         # Do not call functions when jit is used
133   1104         full_backward_hooks, non_full_backward_hooks = [], []
134
135~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_value, output_attentions)
136    328             encoder_attention_mask,
137    329             past_key_value,
138--&gt; 330             output_attentions,
139    331         )
140    332         attention_output = self.output(self_outputs[0], hidden_states)
141
142~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
143   1100         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
144   1101                 or _global_forward_hooks or _global_forward_pre_hooks):
145-&gt; 1102             return forward_call(*input, **kwargs)
146   1103         # Do not call functions when jit is used
147   1104         full_backward_hooks, non_full_backward_hooks = [], []
148
149~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_value, output_attentions)
150    241                 attention_scores = attention_scores + relative_position_scores_query + relative_position_scores_key
151    242 
152--&gt; 243         attention_scores = attention_scores / math.sqrt(self.attention_head_size)
153    244         if attention_mask is not None:
154    245             # Apply the attention mask is (precomputed for all layers in RobertaModel forward() function)
155
156RuntimeError: CUDA out of memory. Tried to allocate 3.60 GiB (GPU 0; 14.76 GiB total capacity; 7.33 GiB already allocated; 1.37 GiB free; 12.29 GiB reserved in total by PyTorch) If reserved memory is &gt;&gt; allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
157---------------------------------------------------------------------------
158RuntimeError                              Traceback (most recent call last)
159&lt;ipython-input-25-27dfca549a7d&gt; in &lt;module&gt;
160     14 
161     15 # classify using gpu, batch_size makes sure we do not run out of memory
162---&gt; 16 classified_docs = doc_classifier.predict(docs_to_classify)
163     17 
164     18 # let's see how it looks: there should be a classification result in the meta entry containing labels and scores.
165
166~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/haystack/nodes/document_classifier/transformers.py in predict(self, documents)
167    137         batches = self.get_batches(texts, batch_size=self.batch_size)
168    138         if self.task == 'zero-shot-classification':
169--&gt; 139             batched_predictions = [self.model(batch, candidate_labels=self.labels, truncation=True) for batch in batches]
170    140         elif self.task == 'text-classification':
171    141             batched_predictions = [self.model(batch, return_all_scores=self.return_all_scores, truncation=True) for batch in batches]
172
173~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/haystack/nodes/document_classifier/transformers.py in &lt;listcomp&gt;(.0)
174    137         batches = self.get_batches(texts, batch_size=self.batch_size)
175    138         if self.task == 'zero-shot-classification':
176--&gt; 139             batched_predictions = [self.model(batch, candidate_labels=self.labels, truncation=True) for batch in batches]
177    140         elif self.task == 'text-classification':
178    141             batched_predictions = [self.model(batch, return_all_scores=self.return_all_scores, truncation=True) for batch in batches]
179
180~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/pipelines/zero_shot_classification.py in __call__(self, sequences, candidate_labels, hypothesis_template, multi_label, **kwargs)
181    151             sequences = [sequences]
182    152 
183--&gt; 153         outputs = super().__call__(sequences, candidate_labels, hypothesis_template)
184    154         num_sequences = len(sequences)
185    155         candidate_labels = self._args_parser._parse_labels(candidate_labels)
186
187~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/pipelines/base.py in __call__(self, *args, **kwargs)
188    759     def __call__(self, *args, **kwargs):
189    760         inputs = self._parse_and_tokenize(*args, **kwargs)
190--&gt; 761         return self._forward(inputs)
191    762 
192    763     def _forward(self, inputs, return_tensors=False):
193
194~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/pipelines/base.py in _forward(self, inputs, return_tensors)
195    780                 with torch.no_grad():
196    781                     inputs = self.ensure_tensor_on_device(**inputs)
197--&gt; 782                     predictions = self.model(**inputs)[0].cpu()
198    783 
199    784         if return_tensors:
200
201~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
202   1100         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
203   1101                 or _global_forward_hooks or _global_forward_pre_hooks):
204-&gt; 1102             return forward_call(*input, **kwargs)
205   1103         # Do not call functions when jit is used
206   1104         full_backward_hooks, non_full_backward_hooks = [], []
207
208~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, labels, output_attentions, output_hidden_states, return_dict)
209   1162             output_attentions=output_attentions,
210   1163             output_hidden_states=output_hidden_states,
211-&gt; 1164             return_dict=return_dict,
212   1165         )
213   1166         sequence_output = outputs[0]
214
215~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
216   1100         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
217   1101                 or _global_forward_hooks or _global_forward_pre_hooks):
218-&gt; 1102             return forward_call(*input, **kwargs)
219   1103         # Do not call functions when jit is used
220   1104         full_backward_hooks, non_full_backward_hooks = [], []
221
222~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
223    823             output_attentions=output_attentions,
224    824             output_hidden_states=output_hidden_states,
225--&gt; 825             return_dict=return_dict,
226    826         )
227    827         sequence_output = encoder_outputs[0]
228
229~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
230   1100         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
231   1101                 or _global_forward_hooks or _global_forward_pre_hooks):
232-&gt; 1102             return forward_call(*input, **kwargs)
233   1103         # Do not call functions when jit is used
234   1104         full_backward_hooks, non_full_backward_hooks = [], []
235
236~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
237    513                     encoder_attention_mask,
238    514                     past_key_value,
239--&gt; 515                     output_attentions,
240    516                 )
241    517 
242
243~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
244   1100         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
245   1101                 or _global_forward_hooks or _global_forward_pre_hooks):
246-&gt; 1102             return forward_call(*input, **kwargs)
247   1103         # Do not call functions when jit is used
248   1104         full_backward_hooks, non_full_backward_hooks = [], []
249
250~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_value, output_attentions)
251    398             head_mask,
252    399             output_attentions=output_attentions,
253--&gt; 400             past_key_value=self_attn_past_key_value,
254    401         )
255    402         attention_output = self_attention_outputs[0]
256
257~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
258   1100         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
259   1101                 or _global_forward_hooks or _global_forward_pre_hooks):
260-&gt; 1102             return forward_call(*input, **kwargs)
261   1103         # Do not call functions when jit is used
262   1104         full_backward_hooks, non_full_backward_hooks = [], []
263
264~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_value, output_attentions)
265    328             encoder_attention_mask,
266    329             past_key_value,
267--&gt; 330             output_attentions,
268    331         )
269    332         attention_output = self.output(self_outputs[0], hidden_states)
270
271~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
272   1100         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
273   1101                 or _global_forward_hooks or _global_forward_pre_hooks):
274-&gt; 1102             return forward_call(*input, **kwargs)
275   1103         # Do not call functions when jit is used
276   1104         full_backward_hooks, non_full_backward_hooks = [], []
277
278~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_value, output_attentions)
279    241                 attention_scores = attention_scores + relative_position_scores_query + relative_position_scores_key
280    242 
281--&gt; 243         attention_scores = attention_scores / math.sqrt(self.attention_head_size)
282    244         if attention_mask is not None:
283    245             # Apply the attention mask is (precomputed for all layers in RobertaModel forward() function)
284
285RuntimeError: CUDA out of memory. Tried to allocate 3.60 GiB (GPU 0; 14.76 GiB total capacity; 7.33 GiB already allocated; 1.37 GiB free; 12.29 GiB reserved in total by PyTorch) If reserved memory is &gt;&gt; allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
286

Original Code:

1with open('filt_gri.txt', 'r') as filehandle:
2    tags = [current_place.rstrip() for current_place in filehandle.readlines()]
3
4doc_classifier = TransformersDocumentClassifier(model_name_or_path=&quot;cross-encoder/nli-distilroberta-base&quot;,
5                                                task=&quot;zero-shot-classification&quot;,
6                                                labels=tags,
7                                                batch_size=4)
8
9# convert to Document using a fieldmap for custom content fields the classification should run on
10docs_to_classify = [Document.from_dict(d) for d in docs_sliding_window]
11
12# classify using gpu, batch_size makes sure we do not run out of memory
13classified_docs = doc_classifier.predict(docs_to_classify)
14
15# let's see how it looks: there should be a classification result in the meta entry containing labels and scores.
16print(classified_docs[0].to_dict())
17
18all_docs = convert_files_to_dicts(dir_path=doc_dir)
19
20preprocessor_sliding_window = PreProcessor(split_overlap=3,
21                                           split_length=10,
22                                           split_respect_sentence_boundary=False,
23                                           split_by='passage')
24INFO - haystack.modeling.utils -  Using devices: CUDA
25INFO - haystack.modeling.utils -  Using devices: CUDA
26INFO - haystack.modeling.utils -  Number of GPUs: 1
27INFO - haystack.modeling.utils -  Number of GPUs: 1
28---------------------------------------------------------------------------
29RuntimeError                              Traceback (most recent call last)
30&lt;ipython-input-25-27dfca549a7d&gt; in &lt;module&gt;
31     14 
32     15 # classify using gpu, batch_size makes sure we do not run out of memory
33---&gt; 16 classified_docs = doc_classifier.predict(docs_to_classify)
34     17 
35     18 # let's see how it looks: there should be a classification result in the meta entry containing labels and scores.
36
37~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/haystack/nodes/document_classifier/transformers.py in predict(self, documents)
38    137         batches = self.get_batches(texts, batch_size=self.batch_size)
39    138         if self.task == 'zero-shot-classification':
40--&gt; 139             batched_predictions = [self.model(batch, candidate_labels=self.labels, truncation=True) for batch in batches]
41    140         elif self.task == 'text-classification':
42    141             batched_predictions = [self.model(batch, return_all_scores=self.return_all_scores, truncation=True) for batch in batches]
43
44~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/haystack/nodes/document_classifier/transformers.py in &lt;listcomp&gt;(.0)
45    137         batches = self.get_batches(texts, batch_size=self.batch_size)
46    138         if self.task == 'zero-shot-classification':
47--&gt; 139             batched_predictions = [self.model(batch, candidate_labels=self.labels, truncation=True) for batch in batches]
48    140         elif self.task == 'text-classification':
49    141             batched_predictions = [self.model(batch, return_all_scores=self.return_all_scores, truncation=True) for batch in batches]
50
51~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/pipelines/zero_shot_classification.py in __call__(self, sequences, candidate_labels, hypothesis_template, multi_label, **kwargs)
52    151             sequences = [sequences]
53    152 
54--&gt; 153         outputs = super().__call__(sequences, candidate_labels, hypothesis_template)
55    154         num_sequences = len(sequences)
56    155         candidate_labels = self._args_parser._parse_labels(candidate_labels)
57
58~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/pipelines/base.py in __call__(self, *args, **kwargs)
59    759     def __call__(self, *args, **kwargs):
60    760         inputs = self._parse_and_tokenize(*args, **kwargs)
61--&gt; 761         return self._forward(inputs)
62    762 
63    763     def _forward(self, inputs, return_tensors=False):
64
65~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/pipelines/base.py in _forward(self, inputs, return_tensors)
66    780                 with torch.no_grad():
67    781                     inputs = self.ensure_tensor_on_device(**inputs)
68--&gt; 782                     predictions = self.model(**inputs)[0].cpu()
69    783 
70    784         if return_tensors:
71
72~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
73   1100         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
74   1101                 or _global_forward_hooks or _global_forward_pre_hooks):
75-&gt; 1102             return forward_call(*input, **kwargs)
76   1103         # Do not call functions when jit is used
77   1104         full_backward_hooks, non_full_backward_hooks = [], []
78
79~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, labels, output_attentions, output_hidden_states, return_dict)
80   1162             output_attentions=output_attentions,
81   1163             output_hidden_states=output_hidden_states,
82-&gt; 1164             return_dict=return_dict,
83   1165         )
84   1166         sequence_output = outputs[0]
85
86~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
87   1100         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
88   1101                 or _global_forward_hooks or _global_forward_pre_hooks):
89-&gt; 1102             return forward_call(*input, **kwargs)
90   1103         # Do not call functions when jit is used
91   1104         full_backward_hooks, non_full_backward_hooks = [], []
92
93~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
94    823             output_attentions=output_attentions,
95    824             output_hidden_states=output_hidden_states,
96--&gt; 825             return_dict=return_dict,
97    826         )
98    827         sequence_output = encoder_outputs[0]
99
100~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
101   1100         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
102   1101                 or _global_forward_hooks or _global_forward_pre_hooks):
103-&gt; 1102             return forward_call(*input, **kwargs)
104   1103         # Do not call functions when jit is used
105   1104         full_backward_hooks, non_full_backward_hooks = [], []
106
107~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
108    513                     encoder_attention_mask,
109    514                     past_key_value,
110--&gt; 515                     output_attentions,
111    516                 )
112    517 
113
114~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
115   1100         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
116   1101                 or _global_forward_hooks or _global_forward_pre_hooks):
117-&gt; 1102             return forward_call(*input, **kwargs)
118   1103         # Do not call functions when jit is used
119   1104         full_backward_hooks, non_full_backward_hooks = [], []
120
121~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_value, output_attentions)
122    398             head_mask,
123    399             output_attentions=output_attentions,
124--&gt; 400             past_key_value=self_attn_past_key_value,
125    401         )
126    402         attention_output = self_attention_outputs[0]
127
128~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
129   1100         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
130   1101                 or _global_forward_hooks or _global_forward_pre_hooks):
131-&gt; 1102             return forward_call(*input, **kwargs)
132   1103         # Do not call functions when jit is used
133   1104         full_backward_hooks, non_full_backward_hooks = [], []
134
135~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_value, output_attentions)
136    328             encoder_attention_mask,
137    329             past_key_value,
138--&gt; 330             output_attentions,
139    331         )
140    332         attention_output = self.output(self_outputs[0], hidden_states)
141
142~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
143   1100         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
144   1101                 or _global_forward_hooks or _global_forward_pre_hooks):
145-&gt; 1102             return forward_call(*input, **kwargs)
146   1103         # Do not call functions when jit is used
147   1104         full_backward_hooks, non_full_backward_hooks = [], []
148
149~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_value, output_attentions)
150    241                 attention_scores = attention_scores + relative_position_scores_query + relative_position_scores_key
151    242 
152--&gt; 243         attention_scores = attention_scores / math.sqrt(self.attention_head_size)
153    244         if attention_mask is not None:
154    245             # Apply the attention mask is (precomputed for all layers in RobertaModel forward() function)
155
156RuntimeError: CUDA out of memory. Tried to allocate 3.60 GiB (GPU 0; 14.76 GiB total capacity; 7.33 GiB already allocated; 1.37 GiB free; 12.29 GiB reserved in total by PyTorch) If reserved memory is &gt;&gt; allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
157---------------------------------------------------------------------------
158RuntimeError                              Traceback (most recent call last)
159&lt;ipython-input-25-27dfca549a7d&gt; in &lt;module&gt;
160     14 
161     15 # classify using gpu, batch_size makes sure we do not run out of memory
162---&gt; 16 classified_docs = doc_classifier.predict(docs_to_classify)
163     17 
164     18 # let's see how it looks: there should be a classification result in the meta entry containing labels and scores.
165
166~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/haystack/nodes/document_classifier/transformers.py in predict(self, documents)
167    137         batches = self.get_batches(texts, batch_size=self.batch_size)
168    138         if self.task == 'zero-shot-classification':
169--&gt; 139             batched_predictions = [self.model(batch, candidate_labels=self.labels, truncation=True) for batch in batches]
170    140         elif self.task == 'text-classification':
171    141             batched_predictions = [self.model(batch, return_all_scores=self.return_all_scores, truncation=True) for batch in batches]
172
173~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/haystack/nodes/document_classifier/transformers.py in &lt;listcomp&gt;(.0)
174    137         batches = self.get_batches(texts, batch_size=self.batch_size)
175    138         if self.task == 'zero-shot-classification':
176--&gt; 139             batched_predictions = [self.model(batch, candidate_labels=self.labels, truncation=True) for batch in batches]
177    140         elif self.task == 'text-classification':
178    141             batched_predictions = [self.model(batch, return_all_scores=self.return_all_scores, truncation=True) for batch in batches]
179
180~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/pipelines/zero_shot_classification.py in __call__(self, sequences, candidate_labels, hypothesis_template, multi_label, **kwargs)
181    151             sequences = [sequences]
182    152 
183--&gt; 153         outputs = super().__call__(sequences, candidate_labels, hypothesis_template)
184    154         num_sequences = len(sequences)
185    155         candidate_labels = self._args_parser._parse_labels(candidate_labels)
186
187~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/pipelines/base.py in __call__(self, *args, **kwargs)
188    759     def __call__(self, *args, **kwargs):
189    760         inputs = self._parse_and_tokenize(*args, **kwargs)
190--&gt; 761         return self._forward(inputs)
191    762 
192    763     def _forward(self, inputs, return_tensors=False):
193
194~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/pipelines/base.py in _forward(self, inputs, return_tensors)
195    780                 with torch.no_grad():
196    781                     inputs = self.ensure_tensor_on_device(**inputs)
197--&gt; 782                     predictions = self.model(**inputs)[0].cpu()
198    783 
199    784         if return_tensors:
200
201~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
202   1100         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
203   1101                 or _global_forward_hooks or _global_forward_pre_hooks):
204-&gt; 1102             return forward_call(*input, **kwargs)
205   1103         # Do not call functions when jit is used
206   1104         full_backward_hooks, non_full_backward_hooks = [], []
207
208~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, labels, output_attentions, output_hidden_states, return_dict)
209   1162             output_attentions=output_attentions,
210   1163             output_hidden_states=output_hidden_states,
211-&gt; 1164             return_dict=return_dict,
212   1165         )
213   1166         sequence_output = outputs[0]
214
215~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
216   1100         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
217   1101                 or _global_forward_hooks or _global_forward_pre_hooks):
218-&gt; 1102             return forward_call(*input, **kwargs)
219   1103         # Do not call functions when jit is used
220   1104         full_backward_hooks, non_full_backward_hooks = [], []
221
222~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
223    823             output_attentions=output_attentions,
224    824             output_hidden_states=output_hidden_states,
225--&gt; 825             return_dict=return_dict,
226    826         )
227    827         sequence_output = encoder_outputs[0]
228
229~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
230   1100         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
231   1101                 or _global_forward_hooks or _global_forward_pre_hooks):
232-&gt; 1102             return forward_call(*input, **kwargs)
233   1103         # Do not call functions when jit is used
234   1104         full_backward_hooks, non_full_backward_hooks = [], []
235
236~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
237    513                     encoder_attention_mask,
238    514                     past_key_value,
239--&gt; 515                     output_attentions,
240    516                 )
241    517 
242
243~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
244   1100         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
245   1101                 or _global_forward_hooks or _global_forward_pre_hooks):
246-&gt; 1102             return forward_call(*input, **kwargs)
247   1103         # Do not call functions when jit is used
248   1104         full_backward_hooks, non_full_backward_hooks = [], []
249
250~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_value, output_attentions)
251    398             head_mask,
252    399             output_attentions=output_attentions,
253--&gt; 400             past_key_value=self_attn_past_key_value,
254    401         )
255    402         attention_output = self_attention_outputs[0]
256
257~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
258   1100         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
259   1101                 or _global_forward_hooks or _global_forward_pre_hooks):
260-&gt; 1102             return forward_call(*input, **kwargs)
261   1103         # Do not call functions when jit is used
262   1104         full_backward_hooks, non_full_backward_hooks = [], []
263
264~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_value, output_attentions)
265    328             encoder_attention_mask,
266    329             past_key_value,
267--&gt; 330             output_attentions,
268    331         )
269    332         attention_output = self.output(self_outputs[0], hidden_states)
270
271~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
272   1100         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
273   1101                 or _global_forward_hooks or _global_forward_pre_hooks):
274-&gt; 1102             return forward_call(*input, **kwargs)
275   1103         # Do not call functions when jit is used
276   1104         full_backward_hooks, non_full_backward_hooks = [], []
277
278~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_value, output_attentions)
279    241                 attention_scores = attention_scores + relative_position_scores_query + relative_position_scores_key
280    242 
281--&gt; 243         attention_scores = attention_scores / math.sqrt(self.attention_head_size)
282    244         if attention_mask is not None:
283    245             # Apply the attention mask is (precomputed for all layers in RobertaModel forward() function)
284
285RuntimeError: CUDA out of memory. Tried to allocate 3.60 GiB (GPU 0; 14.76 GiB total capacity; 7.33 GiB already allocated; 1.37 GiB free; 12.29 GiB reserved in total by PyTorch) If reserved memory is &gt;&gt; allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
286doc_classifier = TransformersDocumentClassifier(model_name_or_path=&quot;cross-encoder/nli-distilroberta-base&quot;,
287    task=&quot;zero-shot-classification&quot;,
288    labels=[&quot;music&quot;, &quot;natural language processing&quot;, &quot;history&quot;],
289    batch_size=16
290)
291
292# ----------
293
294# convert to Document using a fieldmap for custom content fields the classification should run on
295docs_to_classify = [Document.from_dict(d) for d in docs_sliding_window]
296
297# ----------
298
299# classify using gpu, batch_size makes sure we do not run out of memory
300classified_docs = doc_classifier.predict(docs_to_classify)
301
302# ----------
303
304# let's see how it looks: there should be a classification result in the meta entry containing labels and scores.
305print(classified_docs[0].to_dict())
306

Please let me know if there is anything else I should add to post/ clarify.

ANSWER

Answered 2021-Dec-09 at 11:53

Reducing the batch_size helped me:

1with open('filt_gri.txt', 'r') as filehandle:
2    tags = [current_place.rstrip() for current_place in filehandle.readlines()]
3
4doc_classifier = TransformersDocumentClassifier(model_name_or_path=&quot;cross-encoder/nli-distilroberta-base&quot;,
5                                                task=&quot;zero-shot-classification&quot;,
6                                                labels=tags,
7                                                batch_size=4)
8
9# convert to Document using a fieldmap for custom content fields the classification should run on
10docs_to_classify = [Document.from_dict(d) for d in docs_sliding_window]
11
12# classify using gpu, batch_size makes sure we do not run out of memory
13classified_docs = doc_classifier.predict(docs_to_classify)
14
15# let's see how it looks: there should be a classification result in the meta entry containing labels and scores.
16print(classified_docs[0].to_dict())
17
18all_docs = convert_files_to_dicts(dir_path=doc_dir)
19
20preprocessor_sliding_window = PreProcessor(split_overlap=3,
21                                           split_length=10,
22                                           split_respect_sentence_boundary=False,
23                                           split_by='passage')
24INFO - haystack.modeling.utils -  Using devices: CUDA
25INFO - haystack.modeling.utils -  Using devices: CUDA
26INFO - haystack.modeling.utils -  Number of GPUs: 1
27INFO - haystack.modeling.utils -  Number of GPUs: 1
28---------------------------------------------------------------------------
29RuntimeError                              Traceback (most recent call last)
30&lt;ipython-input-25-27dfca549a7d&gt; in &lt;module&gt;
31     14 
32     15 # classify using gpu, batch_size makes sure we do not run out of memory
33---&gt; 16 classified_docs = doc_classifier.predict(docs_to_classify)
34     17 
35     18 # let's see how it looks: there should be a classification result in the meta entry containing labels and scores.
36
37~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/haystack/nodes/document_classifier/transformers.py in predict(self, documents)
38    137         batches = self.get_batches(texts, batch_size=self.batch_size)
39    138         if self.task == 'zero-shot-classification':
40--&gt; 139             batched_predictions = [self.model(batch, candidate_labels=self.labels, truncation=True) for batch in batches]
41    140         elif self.task == 'text-classification':
42    141             batched_predictions = [self.model(batch, return_all_scores=self.return_all_scores, truncation=True) for batch in batches]
43
44~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/haystack/nodes/document_classifier/transformers.py in &lt;listcomp&gt;(.0)
45    137         batches = self.get_batches(texts, batch_size=self.batch_size)
46    138         if self.task == 'zero-shot-classification':
47--&gt; 139             batched_predictions = [self.model(batch, candidate_labels=self.labels, truncation=True) for batch in batches]
48    140         elif self.task == 'text-classification':
49    141             batched_predictions = [self.model(batch, return_all_scores=self.return_all_scores, truncation=True) for batch in batches]
50
51~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/pipelines/zero_shot_classification.py in __call__(self, sequences, candidate_labels, hypothesis_template, multi_label, **kwargs)
52    151             sequences = [sequences]
53    152 
54--&gt; 153         outputs = super().__call__(sequences, candidate_labels, hypothesis_template)
55    154         num_sequences = len(sequences)
56    155         candidate_labels = self._args_parser._parse_labels(candidate_labels)
57
58~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/pipelines/base.py in __call__(self, *args, **kwargs)
59    759     def __call__(self, *args, **kwargs):
60    760         inputs = self._parse_and_tokenize(*args, **kwargs)
61--&gt; 761         return self._forward(inputs)
62    762 
63    763     def _forward(self, inputs, return_tensors=False):
64
65~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/pipelines/base.py in _forward(self, inputs, return_tensors)
66    780                 with torch.no_grad():
67    781                     inputs = self.ensure_tensor_on_device(**inputs)
68--&gt; 782                     predictions = self.model(**inputs)[0].cpu()
69    783 
70    784         if return_tensors:
71
72~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
73   1100         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
74   1101                 or _global_forward_hooks or _global_forward_pre_hooks):
75-&gt; 1102             return forward_call(*input, **kwargs)
76   1103         # Do not call functions when jit is used
77   1104         full_backward_hooks, non_full_backward_hooks = [], []
78
79~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, labels, output_attentions, output_hidden_states, return_dict)
80   1162             output_attentions=output_attentions,
81   1163             output_hidden_states=output_hidden_states,
82-&gt; 1164             return_dict=return_dict,
83   1165         )
84   1166         sequence_output = outputs[0]
85
86~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
87   1100         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
88   1101                 or _global_forward_hooks or _global_forward_pre_hooks):
89-&gt; 1102             return forward_call(*input, **kwargs)
90   1103         # Do not call functions when jit is used
91   1104         full_backward_hooks, non_full_backward_hooks = [], []
92
93~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
94    823             output_attentions=output_attentions,
95    824             output_hidden_states=output_hidden_states,
96--&gt; 825             return_dict=return_dict,
97    826         )
98    827         sequence_output = encoder_outputs[0]
99
100~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
101   1100         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
102   1101                 or _global_forward_hooks or _global_forward_pre_hooks):
103-&gt; 1102             return forward_call(*input, **kwargs)
104   1103         # Do not call functions when jit is used
105   1104         full_backward_hooks, non_full_backward_hooks = [], []
106
107~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
108    513                     encoder_attention_mask,
109    514                     past_key_value,
110--&gt; 515                     output_attentions,
111    516                 )
112    517 
113
114~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
115   1100         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
116   1101                 or _global_forward_hooks or _global_forward_pre_hooks):
117-&gt; 1102             return forward_call(*input, **kwargs)
118   1103         # Do not call functions when jit is used
119   1104         full_backward_hooks, non_full_backward_hooks = [], []
120
121~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_value, output_attentions)
122    398             head_mask,
123    399             output_attentions=output_attentions,
124--&gt; 400             past_key_value=self_attn_past_key_value,
125    401         )
126    402         attention_output = self_attention_outputs[0]
127
128~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
129   1100         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
130   1101                 or _global_forward_hooks or _global_forward_pre_hooks):
131-&gt; 1102             return forward_call(*input, **kwargs)
132   1103         # Do not call functions when jit is used
133   1104         full_backward_hooks, non_full_backward_hooks = [], []
134
135~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_value, output_attentions)
136    328             encoder_attention_mask,
137    329             past_key_value,
138--&gt; 330             output_attentions,
139    331         )
140    332         attention_output = self.output(self_outputs[0], hidden_states)
141
142~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
143   1100         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
144   1101                 or _global_forward_hooks or _global_forward_pre_hooks):
145-&gt; 1102             return forward_call(*input, **kwargs)
146   1103         # Do not call functions when jit is used
147   1104         full_backward_hooks, non_full_backward_hooks = [], []
148
149~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_value, output_attentions)
150    241                 attention_scores = attention_scores + relative_position_scores_query + relative_position_scores_key
151    242 
152--&gt; 243         attention_scores = attention_scores / math.sqrt(self.attention_head_size)
153    244         if attention_mask is not None:
154    245             # Apply the attention mask is (precomputed for all layers in RobertaModel forward() function)
155
156RuntimeError: CUDA out of memory. Tried to allocate 3.60 GiB (GPU 0; 14.76 GiB total capacity; 7.33 GiB already allocated; 1.37 GiB free; 12.29 GiB reserved in total by PyTorch) If reserved memory is &gt;&gt; allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
157---------------------------------------------------------------------------
158RuntimeError                              Traceback (most recent call last)
159&lt;ipython-input-25-27dfca549a7d&gt; in &lt;module&gt;
160     14 
161     15 # classify using gpu, batch_size makes sure we do not run out of memory
162---&gt; 16 classified_docs = doc_classifier.predict(docs_to_classify)
163     17 
164     18 # let's see how it looks: there should be a classification result in the meta entry containing labels and scores.
165
166~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/haystack/nodes/document_classifier/transformers.py in predict(self, documents)
167    137         batches = self.get_batches(texts, batch_size=self.batch_size)
168    138         if self.task == 'zero-shot-classification':
169--&gt; 139             batched_predictions = [self.model(batch, candidate_labels=self.labels, truncation=True) for batch in batches]
170    140         elif self.task == 'text-classification':
171    141             batched_predictions = [self.model(batch, return_all_scores=self.return_all_scores, truncation=True) for batch in batches]
172
173~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/haystack/nodes/document_classifier/transformers.py in &lt;listcomp&gt;(.0)
174    137         batches = self.get_batches(texts, batch_size=self.batch_size)
175    138         if self.task == 'zero-shot-classification':
176--&gt; 139             batched_predictions = [self.model(batch, candidate_labels=self.labels, truncation=True) for batch in batches]
177    140         elif self.task == 'text-classification':
178    141             batched_predictions = [self.model(batch, return_all_scores=self.return_all_scores, truncation=True) for batch in batches]
179
180~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/pipelines/zero_shot_classification.py in __call__(self, sequences, candidate_labels, hypothesis_template, multi_label, **kwargs)
181    151             sequences = [sequences]
182    152 
183--&gt; 153         outputs = super().__call__(sequences, candidate_labels, hypothesis_template)
184    154         num_sequences = len(sequences)
185    155         candidate_labels = self._args_parser._parse_labels(candidate_labels)
186
187~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/pipelines/base.py in __call__(self, *args, **kwargs)
188    759     def __call__(self, *args, **kwargs):
189    760         inputs = self._parse_and_tokenize(*args, **kwargs)
190--&gt; 761         return self._forward(inputs)
191    762 
192    763     def _forward(self, inputs, return_tensors=False):
193
194~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/pipelines/base.py in _forward(self, inputs, return_tensors)
195    780                 with torch.no_grad():
196    781                     inputs = self.ensure_tensor_on_device(**inputs)
197--&gt; 782                     predictions = self.model(**inputs)[0].cpu()
198    783 
199    784         if return_tensors:
200
201~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
202   1100         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
203   1101                 or _global_forward_hooks or _global_forward_pre_hooks):
204-&gt; 1102             return forward_call(*input, **kwargs)
205   1103         # Do not call functions when jit is used
206   1104         full_backward_hooks, non_full_backward_hooks = [], []
207
208~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, labels, output_attentions, output_hidden_states, return_dict)
209   1162             output_attentions=output_attentions,
210   1163             output_hidden_states=output_hidden_states,
211-&gt; 1164             return_dict=return_dict,
212   1165         )
213   1166         sequence_output = outputs[0]
214
215~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
216   1100         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
217   1101                 or _global_forward_hooks or _global_forward_pre_hooks):
218-&gt; 1102             return forward_call(*input, **kwargs)
219   1103         # Do not call functions when jit is used
220   1104         full_backward_hooks, non_full_backward_hooks = [], []
221
222~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
223    823             output_attentions=output_attentions,
224    824             output_hidden_states=output_hidden_states,
225--&gt; 825             return_dict=return_dict,
226    826         )
227    827         sequence_output = encoder_outputs[0]
228
229~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
230   1100         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
231   1101                 or _global_forward_hooks or _global_forward_pre_hooks):
232-&gt; 1102             return forward_call(*input, **kwargs)
233   1103         # Do not call functions when jit is used
234   1104         full_backward_hooks, non_full_backward_hooks = [], []
235
236~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
237    513                     encoder_attention_mask,
238    514                     past_key_value,
239--&gt; 515                     output_attentions,
240    516                 )
241    517 
242
243~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
244   1100         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
245   1101                 or _global_forward_hooks or _global_forward_pre_hooks):
246-&gt; 1102             return forward_call(*input, **kwargs)
247   1103         # Do not call functions when jit is used
248   1104         full_backward_hooks, non_full_backward_hooks = [], []
249
250~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_value, output_attentions)
251    398             head_mask,
252    399             output_attentions=output_attentions,
253--&gt; 400             past_key_value=self_attn_past_key_value,
254    401         )
255    402         attention_output = self_attention_outputs[0]
256
257~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
258   1100         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
259   1101                 or _global_forward_hooks or _global_forward_pre_hooks):
260-&gt; 1102             return forward_call(*input, **kwargs)
261   1103         # Do not call functions when jit is used
262   1104         full_backward_hooks, non_full_backward_hooks = [], []
263
264~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_value, output_attentions)
265    328             encoder_attention_mask,
266    329             past_key_value,
267--&gt; 330             output_attentions,
268    331         )
269    332         attention_output = self.output(self_outputs[0], hidden_states)
270
271~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
272   1100         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
273   1101                 or _global_forward_hooks or _global_forward_pre_hooks):
274-&gt; 1102             return forward_call(*input, **kwargs)
275   1103         # Do not call functions when jit is used
276   1104         full_backward_hooks, non_full_backward_hooks = [], []
277
278~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_value, output_attentions)
279    241                 attention_scores = attention_scores + relative_position_scores_query + relative_position_scores_key
280    242 
281--&gt; 243         attention_scores = attention_scores / math.sqrt(self.attention_head_size)
282    244         if attention_mask is not None:
283    245             # Apply the attention mask is (precomputed for all layers in RobertaModel forward() function)
284
285RuntimeError: CUDA out of memory. Tried to allocate 3.60 GiB (GPU 0; 14.76 GiB total capacity; 7.33 GiB already allocated; 1.37 GiB free; 12.29 GiB reserved in total by PyTorch) If reserved memory is &gt;&gt; allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
286doc_classifier = TransformersDocumentClassifier(model_name_or_path=&quot;cross-encoder/nli-distilroberta-base&quot;,
287    task=&quot;zero-shot-classification&quot;,
288    labels=[&quot;music&quot;, &quot;natural language processing&quot;, &quot;history&quot;],
289    batch_size=16
290)
291
292# ----------
293
294# convert to Document using a fieldmap for custom content fields the classification should run on
295docs_to_classify = [Document.from_dict(d) for d in docs_sliding_window]
296
297# ----------
298
299# classify using gpu, batch_size makes sure we do not run out of memory
300classified_docs = doc_classifier.predict(docs_to_classify)
301
302# ----------
303
304# let's see how it looks: there should be a classification result in the meta entry containing labels and scores.
305print(classified_docs[0].to_dict())
306batch_size=2
307

Source https://stackoverflow.com/questions/70288528

QUESTION

News article extract using requests,bs4 and newspaper packages. why doesn't links=soup.select(&quot;.r a&quot;) find anything?. This code was working earlier

Asked 2021-Nov-16 at 22:43

Objective: I am trying to download the news article based on the keywords to perform sentiment analysis.

This code was working a few months ago but now it returns a null value. I tried fixing the issue butlinks=soup.select(".r a") return null value.

1import pandas as pd
2import requests
3from bs4 import BeautifulSoup
4import string
5import nltk
6from urllib.request import urlopen
7import sys
8import webbrowser
9import newspaper 
10import time
11from newspaper import Article
12
13Company_name1 =[]
14Article_number1=[]
15Article_Title1=[]
16Article_Authors1=[]
17Article_pub_date1=[]
18Article_Text1=[]
19Article_Summary1=[]
20Article_Keywords1=[]
21Final_dataframe=[]
22
23class Newspapr_pd:
24    def __init__(self,term):
25        self.term=term
26        self.subjectivity=0
27        self.sentiment=0
28        self.url='https://www.google.com/search?q={0}&amp;safe=active&amp;tbs=qdr:w,sdb:1&amp;tbm=nws&amp;source=lnt&amp;dpr=1'.format(self.term)
29    
30    def NewsArticlerun_pd(self):
31        response=requests.get(self.url)
32        response.raise_for_status()
33        #print(response.text)
34        soup=bs4.BeautifulSoup(response.text,'html.parser')
35        links=soup.select(&quot;.r a&quot;)
36       
37        numOpen = min(5, len(links))
38        Article_number=0
39        for i in range(numOpen):
40            response_links = webbrower.open(&quot;https://www.google.com&quot; + links[i].get(&quot;href&quot;))
41            
42            
43            
44        #For different language newspaper refer above table 
45            article = Article(response_links, language=&quot;en&quot;) # en for English 
46            Article_number+=1
47            
48            print('*************************************************************************************')
49            
50            Article_number1.append(Article_number)
51            Company_name1.append(self.term)
52
53        #To download the article 
54            try:
55
56                article.download() 
57                 #To parse the article 
58                article.parse() 
59                #To perform natural language processing ie..nlp 
60                article.nlp() 
61  
62        #To extract title
63                Article_Title1.append(article.title)
64
65  
66        #To extract text
67                Article_Text1.append(article.text)
68
69  
70        #To extract Author name
71                Article_Authors1.append(article.authors)
72
73                
74        #To extract article published date
75                Article_pub_date1.append(article.publish_date)
76                
77
78                
79        #To extract summary
80                Article_Summary1.append(article.summary)
81                
82
83  
84        #To extract keywords 
85                Article_Keywords1.append(article.keywords)
86
87            except:
88                print('Error in loading page')
89                continue
90  
91        for art_num,com_name,title,text,auth,pub_dt,summaries,keywds in zip(Article_number1,Company_name1,Article_Title1,Article_Text1,Article_Authors1,Article_pub_date1,Article_Summary1,Article_Keywords1):
92            Final_dataframe.append({'Article_link_num':art_num, 'Company_name':com_name,'Article_Title':title,'Article_Text':text,'Article_Author':auth,
93                                   'Article_Published_date':pub_dt,'Article_Summary':summaries,'Article_Keywords':keywds})
94        
95list_of_companies=['Amazon','Jetairways','nirav modi']
96
97for i in list_of_companies:
98    comp = str('&quot;'+ i + '&quot;')
99    a=Newspapr_pd(comp)
100    a.NewsArticlerun_pd()
101
102Final_new_dataframe=pd.DataFrame(Final_dataframe)
103Final_new_dataframe.tail()    
104

ANSWER

Answered 2021-Nov-16 at 22:43

This is a very complex issue, because Google News continually changes their class names. Additionally Google will add various prefixes to article urls and throw in some hidden ad or social media tags.

The answer below only addresses scraping articles from Google news. More testing is needed to determine how it works with a large amount of keywords and with Google News changing page structure.

The Newspaper3k extraction is even more complex, because each article can have a different structure. I would recommend looking at my Newspaper3k Usage Overview document for details on how to design that part of your code.

P.S. I'm current writing a new news scraper, because the development for Newspaper3k is dead. I'm unsure of the release date of my code.

1import pandas as pd
2import requests
3from bs4 import BeautifulSoup
4import string
5import nltk
6from urllib.request import urlopen
7import sys
8import webbrowser
9import newspaper 
10import time
11from newspaper import Article
12
13Company_name1 =[]
14Article_number1=[]
15Article_Title1=[]
16Article_Authors1=[]
17Article_pub_date1=[]
18Article_Text1=[]
19Article_Summary1=[]
20Article_Keywords1=[]
21Final_dataframe=[]
22
23class Newspapr_pd:
24    def __init__(self,term):
25        self.term=term
26        self.subjectivity=0
27        self.sentiment=0
28        self.url='https://www.google.com/search?q={0}&amp;safe=active&amp;tbs=qdr:w,sdb:1&amp;tbm=nws&amp;source=lnt&amp;dpr=1'.format(self.term)
29    
30    def NewsArticlerun_pd(self):
31        response=requests.get(self.url)
32        response.raise_for_status()
33        #print(response.text)
34        soup=bs4.BeautifulSoup(response.text,'html.parser')
35        links=soup.select(&quot;.r a&quot;)
36       
37        numOpen = min(5, len(links))
38        Article_number=0
39        for i in range(numOpen):
40            response_links = webbrower.open(&quot;https://www.google.com&quot; + links[i].get(&quot;href&quot;))
41            
42            
43            
44        #For different language newspaper refer above table 
45            article = Article(response_links, language=&quot;en&quot;) # en for English 
46            Article_number+=1
47            
48            print('*************************************************************************************')
49            
50            Article_number1.append(Article_number)
51            Company_name1.append(self.term)
52
53        #To download the article 
54            try:
55
56                article.download() 
57                 #To parse the article 
58                article.parse() 
59                #To perform natural language processing ie..nlp 
60                article.nlp() 
61  
62        #To extract title
63                Article_Title1.append(article.title)
64
65  
66        #To extract text
67                Article_Text1.append(article.text)
68
69  
70        #To extract Author name
71                Article_Authors1.append(article.authors)
72
73                
74        #To extract article published date
75                Article_pub_date1.append(article.publish_date)
76                
77
78                
79        #To extract summary
80                Article_Summary1.append(article.summary)
81                
82
83  
84        #To extract keywords 
85                Article_Keywords1.append(article.keywords)
86
87            except:
88                print('Error in loading page')
89                continue
90  
91        for art_num,com_name,title,text,auth,pub_dt,summaries,keywds in zip(Article_number1,Company_name1,Article_Title1,Article_Text1,Article_Authors1,Article_pub_date1,Article_Summary1,Article_Keywords1):
92            Final_dataframe.append({'Article_link_num':art_num, 'Company_name':com_name,'Article_Title':title,'Article_Text':text,'Article_Author':auth,
93                                   'Article_Published_date':pub_dt,'Article_Summary':summaries,'Article_Keywords':keywds})
94        
95list_of_companies=['Amazon','Jetairways','nirav modi']
96
97for i in list_of_companies:
98    comp = str('&quot;'+ i + '&quot;')
99    a=Newspapr_pd(comp)
100    a.NewsArticlerun_pd()
101
102Final_new_dataframe=pd.DataFrame(Final_dataframe)
103Final_new_dataframe.tail()    
104import requests
105import re as regex
106from bs4 import BeautifulSoup
107
108
109def get_google_news_article(search_string):
110    articles = []
111    url = f'https://www.google.com/search?q={search_string}&amp;safe=active&amp;tbs=qdr:w,sdb:1&amp;tbm=nws&amp;source=lnt&amp;dpr=1'
112    response = requests.get(url)
113    raw_html = BeautifulSoup(response.text, &quot;lxml&quot;)
114    main_tag = raw_html.find('div', {'id': 'main'})
115    for div_tag in main_tag.find_all('div', {'class': regex.compile('xpd')}):
116        for a_tag in div_tag.find_all('a', href=True):
117            if not a_tag.get('href').startswith('/search?'):
118                none_articles = bool(regex.search('amazon.com|facebook.com|twitter.com|youtube.com|wikipedia.org', a_tag['href']))
119                if none_articles is False:
120                    if a_tag.get('href').startswith('/url?q='):
121                        find_article = regex.search('(.*)(&amp;sa=)', a_tag.get('href'))
122                        article = find_article.group(1).replace('/url?q=', '')
123                        if article.startswith('https://'):
124                            articles.append(article)
125
126    return articles
127
128                
129
130list_of_companies = ['amazon', 'jet airways', 'nirav modi']
131for company_name in list_of_companies:
132    print(company_name)
133    search_results = get_google_news_article(company_name)
134    for item in sorted(set(search_results)):
135        print(item)
136    print('\n')
137

This is the output from the code above:

1import pandas as pd
2import requests
3from bs4 import BeautifulSoup
4import string
5import nltk
6from urllib.request import urlopen
7import sys
8import webbrowser
9import newspaper 
10import time
11from newspaper import Article
12
13Company_name1 =[]
14Article_number1=[]
15Article_Title1=[]
16Article_Authors1=[]
17Article_pub_date1=[]
18Article_Text1=[]
19Article_Summary1=[]
20Article_Keywords1=[]
21Final_dataframe=[]
22
23class Newspapr_pd:
24    def __init__(self,term):
25        self.term=term
26        self.subjectivity=0
27        self.sentiment=0
28        self.url='https://www.google.com/search?q={0}&amp;safe=active&amp;tbs=qdr:w,sdb:1&amp;tbm=nws&amp;source=lnt&amp;dpr=1'.format(self.term)
29    
30    def NewsArticlerun_pd(self):
31        response=requests.get(self.url)
32        response.raise_for_status()
33        #print(response.text)
34        soup=bs4.BeautifulSoup(response.text,'html.parser')
35        links=soup.select(&quot;.r a&quot;)
36       
37        numOpen = min(5, len(links))
38        Article_number=0
39        for i in range(numOpen):
40            response_links = webbrower.open(&quot;https://www.google.com&quot; + links[i].get(&quot;href&quot;))
41            
42            
43            
44        #For different language newspaper refer above table 
45            article = Article(response_links, language=&quot;en&quot;) # en for English 
46            Article_number+=1
47            
48            print('*************************************************************************************')
49            
50            Article_number1.append(Article_number)
51            Company_name1.append(self.term)
52
53        #To download the article 
54            try:
55
56                article.download() 
57                 #To parse the article 
58                article.parse() 
59                #To perform natural language processing ie..nlp 
60                article.nlp() 
61  
62        #To extract title
63                Article_Title1.append(article.title)
64
65  
66        #To extract text
67                Article_Text1.append(article.text)
68
69  
70        #To extract Author name
71                Article_Authors1.append(article.authors)
72
73                
74        #To extract article published date
75                Article_pub_date1.append(article.publish_date)
76                
77
78                
79        #To extract summary
80                Article_Summary1.append(article.summary)
81                
82
83  
84        #To extract keywords 
85                Article_Keywords1.append(article.keywords)
86
87            except:
88                print('Error in loading page')
89                continue
90  
91        for art_num,com_name,title,text,auth,pub_dt,summaries,keywds in zip(Article_number1,Company_name1,Article_Title1,Article_Text1,Article_Authors1,Article_pub_date1,Article_Summary1,Article_Keywords1):
92            Final_dataframe.append({'Article_link_num':art_num, 'Company_name':com_name,'Article_Title':title,'Article_Text':text,'Article_Author':auth,
93                                   'Article_Published_date':pub_dt,'Article_Summary':summaries,'Article_Keywords':keywds})
94        
95list_of_companies=['Amazon','Jetairways','nirav modi']
96
97for i in list_of_companies:
98    comp = str('&quot;'+ i + '&quot;')
99    a=Newspapr_pd(comp)
100    a.NewsArticlerun_pd()
101
102Final_new_dataframe=pd.DataFrame(Final_dataframe)
103Final_new_dataframe.tail()    
104import requests
105import re as regex
106from bs4 import BeautifulSoup
107
108
109def get_google_news_article(search_string):
110    articles = []
111    url = f'https://www.google.com/search?q={search_string}&amp;safe=active&amp;tbs=qdr:w,sdb:1&amp;tbm=nws&amp;source=lnt&amp;dpr=1'
112    response = requests.get(url)
113    raw_html = BeautifulSoup(response.text, &quot;lxml&quot;)
114    main_tag = raw_html.find('div', {'id': 'main'})
115    for div_tag in main_tag.find_all('div', {'class': regex.compile('xpd')}):
116        for a_tag in div_tag.find_all('a', href=True):
117            if not a_tag.get('href').startswith('/search?'):
118                none_articles = bool(regex.search('amazon.com|facebook.com|twitter.com|youtube.com|wikipedia.org', a_tag['href']))
119                if none_articles is False:
120                    if a_tag.get('href').startswith('/url?q='):
121                        find_article = regex.search('(.*)(&amp;sa=)', a_tag.get('href'))
122                        article = find_article.group(1).replace('/url?q=', '')
123                        if article.startswith('https://'):
124                            articles.append(article)
125
126    return articles
127
128                
129
130list_of_companies = ['amazon', 'jet airways', 'nirav modi']
131for company_name in list_of_companies:
132    print(company_name)
133    search_results = get_google_news_article(company_name)
134    for item in sorted(set(search_results)):
135        print(item)
136    print('\n')
137amazon
138https://9to5mac.com/2021/11/15/amazon-releases-native-prime-video-app-for-macos-with-purchase-support-and-more/
139https://wtvbam.com/2021/11/15/india-police-to-question-amazon-executives-in-probe-over-marijuana-smuggling/
140https://www.cnet.com/home/smart-home/all-the-new-amazon-features-for-your-smart-home-alexa-disney-echo/
141https://www.cnet.com/tech/amazon-unveils-black-friday-deals-starting-on-nov-25/
142https://www.crossroadstoday.com/i/amazons-best-black-friday-deals-for-2021-2/
143https://www.reuters.com/technology/ibm-amazon-partner-extend-reach-data-tools-oil-companies-2021-11-15/
144https://www.theverge.com/2021/11/15/22783275/amazon-basics-smart-switches-price-release-date-specs
145https://www.tomsguide.com/news/amazon-echo-motion-detection
146https://www.usatoday.com/story/money/shopping/2021/11/15/amazon-black-friday-2021-deals-online/8623710002/
147https://www.winknews.com/2021/11/15/new-amazon-sortation-center-began-operations-monday-could-bring-faster-deliveries/
148
149jet airways
150https://economictimes.indiatimes.com/markets/expert-view/first-time-in-two-decades-new-airlines-are-starting-instead-of-closing-down-jyotiraditya-scindia/articleshow/87660724.cms
151https://menafn.com/1103125331/Jet-Airways-to-resume-operations-in-Q1-2022
152https://simpleflying.com/jet-airways-100-aircraft-5-years/
153https://simpleflying.com/jet-airways-q3-loss/
154https://www.business-standard.com/article/companies/defunct-carrier-jet-airways-posts-rs-306-cr-loss-in-september-quarter-121110901693_1.html
155https://www.business-standard.com/article/markets/stocks-to-watch-ril-aurobindo-bhel-m-m-jet-airways-idfc-powergrid-121110900189_1.html
156https://www.financialexpress.com/market/nykaa-hdfc-zee-media-jet-airways-power-grid-berger-paints-petronet-lng-stocks-in-focus/2366063/
157https://www.moneycontrol.com/news/business/earnings/jet-airways-standalone-september-2021-net-sales-at-rs-41-02-crore-up-313-51-y-o-y-7702891.html
158https://www.spokesman.com/stories/2021/nov/11/boeing-set-to-dent-airbus-india-dominance-with-737/
159https://www.timesnownews.com/business-economy/industry/article/times-now-summit-2021-jet-airways-will-make-a-comeback-into-indian-skies-akasa-to-take-off-next-year-says-jyotiraditya-scindia/831090
160
161
162nirav modi
163https://m.republicworld.com/india-news/general-news/piyush-goyal-says-few-rotten-eggs-destroyed-credibility-of-countrys-ca-sector.html
164https://www.bulletnews.net/akkad-bakkad-rafu-chakkar-review-the-story-of-robbing-people-by-making-fake-banks/
165https://www.daijiworld.com/news/newsDisplay%3FnewsID%3D893048
166https://www.devdiscourse.com/article/law-order/1805317-hc-seeks-centres-stand-on-bankers-challenge-to-dismissal-from-service
167https://www.geo.tv/latest/381560-arif-naqvis-extradition-case-to-be-heard-after-nirav-modi-case-ruling
168https://www.hindustantimes.com/india-news/cbiand-ed-appointments-that-triggered-controversies-101636954580012.html
169https://www.law360.com/articles/1439470/suicide-test-ruling-delays-abraaj-founder-s-extradition-case
170https://www.moneycontrol.com/news/trends/current-affairs-trends/nirav-modi-extradition-case-outcome-of-appeal-to-also-affect-pakistani-origin-global-financier-facing-16-charges-of-fraud-and-money-laundering-7717231.html
171https://www.thehansindia.com/hans/opinion/news-analysis/uniform-law-needed-for-free-exit-of-rich-businessmen-714566
172https://www.thenews.com.pk/print/908374-uk-judge-delays-arif-naqvi-s-extradition-to-us
173
174

Source https://stackoverflow.com/questions/69938681

QUESTION

Changing Stanza model directory for pyinstaller executable

Asked 2021-Nov-01 at 06:03

I have an application that analyses text looking for keywords using natural language processing.
I created an executable and it works fine on my computer.
I've sent it to a friend but in his computer he gets an error:

1Traceback (most recent call last):
2  File &quot;Main.py&quot;, line 15, in &lt;module&gt;
3  File &quot;Menu.py&quot;, line 349, in main_menu
4  File &quot;Menu.py&quot;, line 262, in analyse_text_menu
5  File &quot;Menu.py&quot;, line 178, in analyse_text_function
6  File &quot;AnalyseText\ProcessText.py&quot;, line 232, in process_text
7  File &quot;AnalyseText\ProcessText.py&quot;, line 166, in generate_keyword_complete_list
8  File &quot;AnalyseText\ProcessText.py&quot;, line 135, in lemmatize_text
9  File &quot;stanza\pipeline\core.py&quot;, line 88, in _init_
10stanza.pipeline.core.ResourcesFileNotFoundError: Resources file not found at: C:\Users\jpovoas\stanza_resources\resources.json  Try to download the model again.
11[26408] Failed to execute script 'Main' due to unhandled exception!
12

It's looking for resources.json inside a folder in his computer. Even though I've added stanza as a hidden import with pyinstaller.

I'm using a model in another language, as opposed to the default one in english. The model is located in a folder inside the User folder.

The thing is, I don't want the end user to have to download the model separatedly.

I've managed to include the model folder with --add-data C:\Users\Laila\stanza_resources\pt;Stanza" when creating the executable.

It still looks for the model's json file inside the stanza_resources folder that's should be inside the User folder of whoever is using the program.

How do I tell stanza to look for the model inside the executable folder generated instead?

Can I just add stanza.download("language") in my script? If so how do I change stanza's model download folder? I want it to be downloaded into a folder inside the same directory as the executable. How do I do that?

ANSWER

Answered 2021-Sep-14 at 04:27

You can try downloading that JSON file.
Here is an snippet for the same.

1Traceback (most recent call last):
2  File &quot;Main.py&quot;, line 15, in &lt;module&gt;
3  File &quot;Menu.py&quot;, line 349, in main_menu
4  File &quot;Menu.py&quot;, line 262, in analyse_text_menu
5  File &quot;Menu.py&quot;, line 178, in analyse_text_function
6  File &quot;AnalyseText\ProcessText.py&quot;, line 232, in process_text
7  File &quot;AnalyseText\ProcessText.py&quot;, line 166, in generate_keyword_complete_list
8  File &quot;AnalyseText\ProcessText.py&quot;, line 135, in lemmatize_text
9  File &quot;stanza\pipeline\core.py&quot;, line 88, in _init_
10stanza.pipeline.core.ResourcesFileNotFoundError: Resources file not found at: C:\Users\jpovoas\stanza_resources\resources.json  Try to download the model again.
11[26408] Failed to execute script 'Main' due to unhandled exception!
12import urllib.request
13
14json_url = &quot;http://LINK_TO_YOUR_JSON/resources.json&quot;
15
16urllib.request.urlretrieve(json_url, &quot;./resources.json&quot;)
17

Source https://stackoverflow.com/questions/69164174

QUESTION

Type-Token Ratio in Google Sheets: How to manipulate long strings of text (millions of characters)

Asked 2021-Oct-03 at 11:39

Here’s the challenge. In a Google Sheets spreadsheet, I have a column in which can be found a range of cells containing lists of words separated by comas, one per row, up to a thousand row. Each list show the words taken from a text, in alpha-numeral order, from a few hundred to a few thousand words. I need to count both the total of words in all the rows, taken together, and the number of unique word forms too. In other words, from the glossary of natural language processing, I want to know the number of tokens and the number of types in my corpus, in order to calculate the type-token ratio or lexical density.

In particular, finding the number of unique word forms in the whole column have proven to be a challenge. In an ARRAY FORMULA, with corresponding functions, I’ve JOINED the strings, SPLITED the words, TRANSPOSED them, then removed duplicates with UNIQUE function, then counted the remaining word forms. This worked on a sample corpus constituted of a little over ten lists of words, but failed when I reached fifteen or so lists of words taken together, a far cry from the thousand lists I need to join in my formula to obtain the results I am looking for.

From what I can gather, the problem would reside in that the resulting string I intend to manipulate is exceeding 50,000 characters. Here and there, for specific cases, I’ve found similar questions, and propositions for workarounds, mostly through custom functions, but I could not replicate the result. Needless to say, writing custom fonctions on my own is beyond my reach. Someone suggests to use QUERY headers, but I did not figured either if this was of any help in my case.

The formulas I came up with are the following:

To obtain the total number of words (tokens) through all the lists: =COUNTA(ARRAYFORMULA(SPLIT(JOIN(",";1;B2:B);",")))

To obtain the number of unique word forms (types) through all the lists: =COUNTA(ARRAYFORMULA(UNIQUE(TRANSPOSE(SPLIT(JOIN(",";1;B2:B);",")))))

A sample in a spreadsheet can be found here.

EDIT 1:

I’ve included the column of texts stripped of ponctuation, from which the lists of words are generated, and the formula used to generate them.

EDIT 2:

Changed the title to better reflect the general intent.

ANSWER

Answered 2021-Oct-02 at 13:57

For total items, try:

1=arrayformula(query(flatten(iferror(split(B2:B;&quot;,&quot;;1);));&quot;select count(Col1) where Col1 !='' label count(Col1) '' &quot;;0))
2

For total unique items:

1=arrayformula(query(flatten(iferror(split(B2:B;&quot;,&quot;;1);));&quot;select count(Col1) where Col1 !='' label count(Col1) '' &quot;;0))
2=arrayformula(query(unique(flatten(iferror(split(B2:B;&quot;,&quot;;1);)));&quot;select count(Col1) where Col1 !='' label count(Col1) '' &quot;;0))
3

You might get problems if you have too many rows in the sheet. If so, set the range limit to something like B2:B1000

Add this to cell C1 to get a list of 'Comma separated items':

1=arrayformula(query(flatten(iferror(split(B2:B;&quot;,&quot;;1);));&quot;select count(Col1) where Col1 !='' label count(Col1) '' &quot;;0))
2=arrayformula(query(unique(flatten(iferror(split(B2:B;&quot;,&quot;;1);)));&quot;select count(Col1) where Col1 !='' label count(Col1) '' &quot;;0))
3=arrayformula({&quot;Comma separated items&quot;;if(B2:B&lt;&gt;&quot;&quot;;len(regexreplace(B2:B;&quot;[^\,]&quot;;))+1;)})
4

Explanation:

The arrayformula() allows the calculation to cascade down the sheet, from one cell.

So within the arrayformula(), the starting point is the split(B2:B;",") to create columns for each of the comma separated items.

The iferror(split(B2:B;",");"") leaves a blank where cells don't have a comma (like those from row 32). Instead of ;"") shown above I usually just use ;), removing "" so nothing is the result of the iferror.

Then flatten() takes all of the columns and flattens them into a single column.

query() is needed to count the resulting column count(Col1) where no cells are empty where Col1 !='', and the label count(Col1) '' removea a label 'count' which would usually be displayed.

For the list of unique values, unique() is placed before thequery(), after the flatten().

Source https://stackoverflow.com/questions/69416862

QUESTION

a bug for tf.keras.layers.TextVectorization when built from saved configs and weights

Asked 2021-Sep-28 at 13:57

I have tried writing a python program to save tf.keras.layers.TextVectorization to disk and load it with the answer of How to save TextVectorization to disk in tensorflow?. The TextVectorization layer built from saved configs outputs a vector with wrong length when the arg output_sequence_length is not None and output_mode='int'. For example, if I set output_sequence_length= 10, and output_mode='int', it is expected that given a text, TextVectorization should output a vector with length of 10, see vectorizer and new_v2 in the code below. However, if TextVectorization's arg output_mode='int' is set from saved configs, it doesn't output a vector with length of 10(actually it is 9, the real length of the sentence. It seems like output_sequence_length is not set successfully). See the object new_v1 in the code below. The interesting thing is, I have compared from_disk['config']['output_mode'] and 'int', they equal to each other.

1import tensorflow as tf
2from tensorflow.keras.models import load_model
3import pickle
4# In[]
5max_len = 10  # Sequence length to pad the outputs to.
6text_dataset = tf.data.Dataset.from_tensor_slices([
7                                                   &quot;I like natural language processing&quot;,
8                                                   &quot;You like computer vision&quot;,
9                                                   &quot;I like computer games and computer science&quot;])
10# Fit a TextVectorization layer
11VOCAB_SIZE = 10  # Maximum vocab size.
12vectorizer = tf.keras.layers.TextVectorization(
13        max_tokens=None,
14        standardize=&quot;lower_and_strip_punctuation&quot;,
15        split=&quot;whitespace&quot;,
16        output_mode='int',
17        output_sequence_length=max_len
18        )
19vectorizer.adapt(text_dataset.batch(64))
20# In[]
21#print(vectorizer.get_vocabulary())
22#print(vectorizer.get_config())
23#print(vectorizer.get_weights())
24# In[]
25
26
27# Pickle the config and weights
28pickle.dump({'config': vectorizer.get_config(),
29             'weights': vectorizer.get_weights()}
30            , open(&quot;./models/tv_layer.pkl&quot;, &quot;wb&quot;))
31
32
33# Later you can unpickle and use
34# `config` to create object and
35# `weights` to load the trained weights.
36
37from_disk = pickle.load(open(&quot;./models/tv_layer.pkl&quot;, &quot;rb&quot;))
38
39new_v1 = tf.keras.layers.TextVectorization(
40        max_tokens=None,
41        standardize=&quot;lower_and_strip_punctuation&quot;,
42        split=&quot;whitespace&quot;,
43        output_mode=from_disk['config']['output_mode'],
44        output_sequence_length=from_disk['config']['output_sequence_length'],
45        )
46# You have to call `adapt` with some dummy data (BUG in Keras)
47new_v1.adapt(tf.data.Dataset.from_tensor_slices([&quot;xyz&quot;]))
48new_v1.set_weights(from_disk['weights'])
49new_v2 = tf.keras.layers.TextVectorization(
50        max_tokens=None,
51        standardize=&quot;lower_and_strip_punctuation&quot;,
52        split=&quot;whitespace&quot;,
53        output_mode='int',
54        output_sequence_length=from_disk['config']['output_sequence_length'],
55        )
56
57# You have to call `adapt` with some dummy data (BUG in Keras)
58new_v2.adapt(tf.data.Dataset.from_tensor_slices([&quot;xyz&quot;]))
59new_v2.set_weights(from_disk['weights'])
60print (&quot;*&quot;*10)
61# In[]
62test_sentence=&quot;Jack likes computer scinece, computer games, and foreign language&quot;
63
64print(vectorizer(test_sentence))
65print (new_v1(test_sentence))
66print (new_v2(test_sentence))
67print(from_disk['config']['output_mode']=='int')
68

Here are the print() outputs:

1import tensorflow as tf
2from tensorflow.keras.models import load_model
3import pickle
4# In[]
5max_len = 10  # Sequence length to pad the outputs to.
6text_dataset = tf.data.Dataset.from_tensor_slices([
7                                                   &quot;I like natural language processing&quot;,
8                                                   &quot;You like computer vision&quot;,
9                                                   &quot;I like computer games and computer science&quot;])
10# Fit a TextVectorization layer
11VOCAB_SIZE = 10  # Maximum vocab size.
12vectorizer = tf.keras.layers.TextVectorization(
13        max_tokens=None,
14        standardize=&quot;lower_and_strip_punctuation&quot;,
15        split=&quot;whitespace&quot;,
16        output_mode='int',
17        output_sequence_length=max_len
18        )
19vectorizer.adapt(text_dataset.batch(64))
20# In[]
21#print(vectorizer.get_vocabulary())
22#print(vectorizer.get_config())
23#print(vectorizer.get_weights())
24# In[]
25
26
27# Pickle the config and weights
28pickle.dump({'config': vectorizer.get_config(),
29             'weights': vectorizer.get_weights()}
30            , open(&quot;./models/tv_layer.pkl&quot;, &quot;wb&quot;))
31
32
33# Later you can unpickle and use
34# `config` to create object and
35# `weights` to load the trained weights.
36
37from_disk = pickle.load(open(&quot;./models/tv_layer.pkl&quot;, &quot;rb&quot;))
38
39new_v1 = tf.keras.layers.TextVectorization(
40        max_tokens=None,
41        standardize=&quot;lower_and_strip_punctuation&quot;,
42        split=&quot;whitespace&quot;,
43        output_mode=from_disk['config']['output_mode'],
44        output_sequence_length=from_disk['config']['output_sequence_length'],
45        )
46# You have to call `adapt` with some dummy data (BUG in Keras)
47new_v1.adapt(tf.data.Dataset.from_tensor_slices([&quot;xyz&quot;]))
48new_v1.set_weights(from_disk['weights'])
49new_v2 = tf.keras.layers.TextVectorization(
50        max_tokens=None,
51        standardize=&quot;lower_and_strip_punctuation&quot;,
52        split=&quot;whitespace&quot;,
53        output_mode='int',
54        output_sequence_length=from_disk['config']['output_sequence_length'],
55        )
56
57# You have to call `adapt` with some dummy data (BUG in Keras)
58new_v2.adapt(tf.data.Dataset.from_tensor_slices([&quot;xyz&quot;]))
59new_v2.set_weights(from_disk['weights'])
60print (&quot;*&quot;*10)
61# In[]
62test_sentence=&quot;Jack likes computer scinece, computer games, and foreign language&quot;
63
64print(vectorizer(test_sentence))
65print (new_v1(test_sentence))
66print (new_v2(test_sentence))
67print(from_disk['config']['output_mode']=='int')
68**********
69tf.Tensor([ 1  1  3  1  3 11 12  1 10  0], shape=(10,), dtype=int64)
70tf.Tensor([ 1  1  3  1  3 11 12  1 10], shape=(9,), dtype=int64)
71tf.Tensor([ 1  1  3  1  3 11 12  1 10  0], shape=(10,), dtype=int64)
72True
73

Does anyone know why?

ANSWER

Answered 2021-Sep-28 at 13:57

the bug is fixed by the PR in https://github.com/keras-team/keras/pull/15422

Source https://stackoverflow.com/questions/69211649

Community Discussions contain sources that include Stack Exchange Network

Tutorials and Learning Resources in Natural Language Processing

Tutorials and Learning Resources are not available at this moment for Natural Language Processing

Share this Page

share link

Get latest updates on Natural Language Processing