Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The result is a computer capable of 'understanding' the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.
Popular New Releases in Natural Language Processing
transformers
v4.18.0: Checkpoint sharding, vision models
HanLP
v1.8.2 常规维护与准确率提升
spaCy
v3.1.6: Workaround for Click/Typer issues
flair
Release 0.11
allennlp
v2.9.2
Popular Libraries in Natural Language Processing
by huggingface python
61400 Apache-2.0
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
by fighting41love python
33333
中英文敏感词、语言检测、中外手机/电话归属地/运营商查询、名字推断性别、手机号抽取、身份证抽取、邮箱抽取、中日文人名库、中文缩写库、拆字词典、词汇情感值、停用词、反动词表、暴恐词表、繁简体转换、英文模拟中文发音、汪峰歌词生成器、职业名称词库、同义词库、反义词库、否定词库、汽车品牌词库、汽车零件词库、连续英文切割、各种中文词向量、公司名字大全、古诗词库、IT词库、财经词库、成语词库、地名词库、历史名人词库、诗词词库、医学词库、饮食词库、法律词库、汽车词库、动物词库、中文聊天语料、中文谣言数据、百度中文问答数据集、句子相似度匹配算法集合、bert资源、文本生成&摘要相关工具、cocoNLP信息抽取工具、国内电话号码正则匹配、清华大学XLORE:中英文跨语言百科知识图谱、清华大学人工智能技术系列报告、自然语言生成、NLU太难了系列、自动对联数据及机器人、用户名黑名单列表、罪名法务名词及分类模型、微信公众号语料、cs224n深度学习自然语言处理课程、中文手写汉字识别、中文自然语言处理 语料/数据集、变量命名神器、分词语料库+代码、任务型对话英文数据集、ASR 语音数据集 + 基于深度学习的中文语音识别系统、笑声检测器、Microsoft多语言数字/单位/如日期时间识别包、中华新华字典数据库及api(包括常用歇后语、成语、词语和汉字)、文档图谱自动生成、SpaCy 中文模型、Common Voice语音识别数据集新版、神经网络关系抽取、基于bert的命名实体识别、关键词(Keyphrase)抽取包pke、基于医疗领域知识图谱的问答系统、基于依存句法与语义角色标注的事件三元组抽取、依存句法分析4万句高质量标注数据、cnocr:用来做中文OCR的Python3包、中文人物关系知识图谱项目、中文nlp竞赛项目及代码汇总、中文字符数据、speech-aligner: 从“人声语音”及其“语言文本”产生音素级别时间对齐标注的工具、AmpliGraph: 知识图谱表示学习(Python)库:知识图谱概念链接预测、Scattertext 文本可视化(python)、语言/知识表示工具:BERT & ERNIE、中文对比英文自然语言处理NLP的区别综述、Synonyms中文近义词工具包、HarvestText领域自适应文本挖掘工具(新词发现-情感分析-实体链接等)、word2word:(Python)方便易用的多语言词-词对集:62种语言/3,564个多语言对、语音识别语料生成工具:从具有音频/字幕的在线视频创建自动语音识别(ASR)语料库、构建医疗实体识别的模型(包含词典和语料标注)、单文档非监督的关键词抽取、Kashgari中使用gpt-2语言模型、开源的金融投资数据提取工具、文本自动摘要库TextTeaser: 仅支持英文、人民日报语料处理工具集、一些关于自然语言的基本模型、基于14W歌曲知识库的问答尝试--功能包括歌词接龙and已知歌词找歌曲以及歌曲歌手歌词三角关系的问答、基于Siamese bilstm模型的相似句子判定模型并提供训练数据集和测试数据集、用Transformer编解码模型实现的根据Hacker News文章标题自动生成评论、用BERT进行序列标记和文本分类的模板代码、LitBank:NLP数据集——支持自然语言处理和计算人文学科任务的100部带标记英文小说语料、百度开源的基准信息抽取系统、虚假新闻数据集、Facebook: LAMA语言模型分析,提供Transformer-XL/BERT/ELMo/GPT预训练语言模型的统一访问接口、CommonsenseQA:面向常识的英文QA挑战、中文知识图谱资料、数据及工具、各大公司内部里大牛分享的技术文档 PDF 或者 PPT、自然语言生成SQL语句(英文)、中文NLP数据增强(EDA)工具、英文NLP数据增强工具 、基于医药知识图谱的智能问答系统、京东商品知识图谱、基于mongodb存储的军事领域知识图谱问答项目、基于远监督的中文关系抽取、语音情感分析、中文ULMFiT-情感分析-文本分类-语料及模型、一个拍照做题程序、世界各国大规模人名库、一个利用有趣中文语料库 qingyun 训练出来的中文聊天机器人、中文聊天机器人seqGAN、省市区镇行政区划数据带拼音标注、教育行业新闻语料库包含自动文摘功能、开放了对话机器人-知识图谱-语义理解-自然语言处理工具及数据、中文知识图谱:基于百度百科中文页面-抽取三元组信息-构建中文知识图谱、masr: 中文语音识别-提供预训练模型-高识别率、Python音频数据增广库、中文全词覆盖BERT及两份阅读理解数据、ConvLab:开源多域端到端对话系统平台、中文自然语言处理数据集、基于最新版本rasa搭建的对话系统、基于TensorFlow和BERT的管道式实体及关系抽取、一个小型的证券知识图谱/知识库、复盘所有NLP比赛的TOP方案、OpenCLaP:多领域开源中文预训练语言模型仓库、UER:基于不同语料+编码器+目标任务的中文预训练模型仓库、中文自然语言处理向量合集、基于金融-司法领域(兼有闲聊性质)的聊天机器人、g2pC:基于上下文的汉语读音自动标记模块、Zincbase 知识图谱构建工具包、诗歌质量评价/细粒度情感诗歌语料库、快速转化「中文数字」和「阿拉伯数字」、百度知道问答语料库、基于知识图谱的问答系统、jieba_fast 加速版的jieba、正则表达式教程、中文阅读理解数据集、基于BERT等最新语言模型的抽取式摘要提取、Python利用深度学习进行文本摘要的综合指南、知识图谱深度学习相关资料整理、维基大规模平行文本语料、StanfordNLP 0.2.0:纯Python版自然语言处理包、NeuralNLP-NeuralClassifier:腾讯开源深度学习文本分类工具、端到端的封闭域对话系统、中文命名实体识别:NeuroNER vs. BertNER、新闻事件线索抽取、2019年百度的三元组抽取比赛:“科学空间队”源码、基于依存句法的开放域文本知识三元组抽取和知识库构建、中文的GPT2训练代码、ML-NLP - 机器学习(Machine Learning)NLP面试中常考到的知识点和代码实现、nlp4han:中文自然语言处理工具集(断句/分词/词性标注/组块/句法分析/语义分析/NER/N元语法/HMM/代词消解/情感分析/拼写检查、XLM:Facebook的跨语言预训练语言模型、用基于BERT的微调和特征提取方法来进行知识图谱百度百科人物词条属性抽取、中文自然语言处理相关的开放任务-数据集-当前最佳结果、CoupletAI - 基于CNN+Bi-LSTM+Attention 的自动对对联系统、抽象知识图谱、MiningZhiDaoQACorpus - 580万百度知道问答数据挖掘项目、brat rapid annotation tool: 序列标注工具、大规模中文知识图谱数据:1.4亿实体、数据增强在机器翻译及其他nlp任务中的应用及效果、allennlp阅读理解:支持多种数据和模型、PDF表格数据提取工具 、 Graphbrain:AI开源软件库和科研工具,目的是促进自动意义提取和文本理解以及知识的探索和推断、简历自动筛选系统、基于命名实体识别的简历自动摘要、中文语言理解测评基准,包括代表性的数据集&基准模型&语料库&排行榜、树洞 OCR 文字识别 、从包含表格的扫描图片中识别表格和文字、语声迁移、Python口语自然语言处理工具集(英文)、 similarity:相似度计算工具包,java编写、海量中文预训练ALBERT模型 、Transformers 2.0 、基于大规模音频数据集Audioset的音频增强 、Poplar:网页版自然语言标注工具、图片文字去除,可用于漫画翻译 、186种语言的数字叫法库、Amazon发布基于知识的人-人开放领域对话数据集 、中文文本纠错模块代码、繁简体转换 、 Python实现的多种文本可读性评价指标、类似于人名/地名/组织机构名的命名体识别数据集 、东南大学《知识图谱》研究生课程(资料)、. 英文拼写检查库 、 wwsearch是企业微信后台自研的全文检索引擎、CHAMELEON:深度学习新闻推荐系统元架构 、 8篇论文梳理BERT相关模型进展与反思、DocSearch:免费文档搜索引擎、 LIDA:轻量交互式对话标注工具 、aili - the fastest in-memory index in the East 东半球最快并发索引 、知识图谱车音工作项目、自然语言生成资源大全 、中日韩分词库mecab的Python接口库、中文文本摘要/关键词提取、汉字字符特征提取器 (featurizer),提取汉字的特征(发音特征、字形特征)用做深度学习的特征、中文生成任务基准测评 、中文缩写数据集、中文任务基准测评 - 代表性的数据集-基准(预训练)模型-语料库-baseline-工具包-排行榜、PySS3:面向可解释AI的SS3文本分类器机器可视化工具 、中文NLP数据集列表、COPE - 格律诗编辑程序、doccano:基于网页的开源协同多语言文本标注工具 、PreNLP:自然语言预处理库、简单的简历解析器,用来从简历中提取关键信息、用于中文闲聊的GPT2模型:GPT2-chitchat、基于检索聊天机器人多轮响应选择相关资源列表(Leaderboards、Datasets、Papers)、(Colab)抽象文本摘要实现集锦(教程 、词语拼音数据、高效模糊搜索工具、NLP数据增广资源集、微软对话机器人框架 、 GitHub Typo Corpus:大规模GitHub多语言拼写错误/语法错误数据集、TextCluster:短文本聚类预处理模块 Short text cluster、面向语音识别的中文文本规范化、BLINK:最先进的实体链接库、BertPunc:基于BERT的最先进标点修复模型、Tokenizer:快速、可定制的文本词条化库、中文语言理解测评基准,包括代表性的数据集、基准(预训练)模型、语料库、排行榜、spaCy 医学文本挖掘与信息提取 、 NLP任务示例项目代码集、 python拼写检查库、chatbot-list - 行业内关于智能客服、聊天机器人的应用和架构、算法分享和介绍、语音质量评价指标(MOSNet, BSSEval, STOI, PESQ, SRMR)、 用138GB语料训练的法文RoBERTa预训练语言模型 、BERT-NER-Pytorch:三种不同模式的BERT中文NER实验、无道词典 - 有道词典的命令行版本,支持英汉互查和在线查询、2019年NLP亮点回顾、 Chinese medical dialogue data 中文医疗对话数据集 、最好的汉字数字(中文数字)-阿拉伯数字转换工具、 基于百科知识库的中文词语多词义/义项获取与特定句子词语语义消歧、awesome-nlp-sentiment-analysis - 情感分析、情绪原因识别、评价对象和评价词抽取、LineFlow:面向所有深度学习框架的NLP数据高效加载器、中文医学NLP公开资源整理 、MedQuAD:(英文)医学问答数据集、将自然语言数字串解析转换为整数和浮点数、Transfer Learning in Natural Language Processing (NLP) 、面向语音识别的中文/英文发音辞典、Tokenizers:注重性能与多功能性的最先进分词器、CLUENER 细粒度命名实体识别 Fine Grained Named Entity Recognition、 基于BERT的中文命名实体识别、中文谣言数据库、NLP数据集/基准任务大列表、nlp相关的一些论文及代码, 包括主题模型、词向量(Word Embedding)、命名实体识别(NER)、文本分类(Text Classificatin)、文本生成(Text Generation)、文本相似性(Text Similarity)计算等,涉及到各种与nlp相关的算法,基于keras和tensorflow 、Python文本挖掘/NLP实战示例、 Blackstone:面向非结构化法律文本的spaCy pipeline和NLP模型通过同义词替换实现文本“变脸” 、中文 预训练 ELECTREA 模型: 基于对抗学习 pretrain Chinese Model 、albert-chinese-ner - 用预训练语言模型ALBERT做中文NER 、基于GPT2的特定主题文本生成/文本增广、开源预训练语言模型合集、多语言句向量包、编码、标记和实现:一种可控高效的文本生成方法、 英文脏话大列表 、attnvis:GPT2、BERT等transformer语言模型注意力交互可视化、CoVoST:Facebook发布的多语种语音-文本翻译语料库,包括11种语言(法语、德语、荷兰语、俄语、西班牙语、意大利语、土耳其语、波斯语、瑞典语、蒙古语和中文)的语音、文字转录及英文译文、Jiagu自然语言处理工具 - 以BiLSTM等模型为基础,提供知识图谱关系抽取 中文分词 词性标注 命名实体识别 情感分析 新词发现 关键词 文本摘要 文本聚类等功能、用unet实现对文档表格的自动检测,表格重建、NLP事件提取文献资源列表 、 金融领域自然语言处理研究资源大列表、CLUEDatasetSearch - 中英文NLP数据集:搜索所有中文NLP数据集,附常用英文NLP数据集 、medical_NER - 中文医学知识图谱命名实体识别 、(哈佛)讲因果推理的免费书、知识图谱相关学习资料/数据集/工具资源大列表、Forte:灵活强大的自然语言处理pipeline工具集 、Python字符串相似性算法库、PyLaia:面向手写文档分析的深度学习工具包、TextFooler:针对文本分类/推理的对抗文本生成模块、Haystack:灵活、强大的可扩展问答(QA)框架、中文关键短语抽取工具
by google-research python
28940 Apache-2.0
TensorFlow code and pre-trained models for BERT
by fxsjy python
26924 MIT
结巴中文分词
by geekcomputers python
23653 MIT
My Python Examples
by hankcs python
23581 Apache-2.0
中文分词 词性标注 命名实体识别 依存句法分析 语义依存分析 新词发现 关键词短语提取 自动摘要 文本分类聚类 拼音简繁转换 自然语言处理
by explosion python
23063 MIT
💫 Industrial-strength Natural Language Processing (NLP) in Python
by facebookresearch html
22903 MIT
Library for fast text representation and classification.
by sebastianruder python
18988 MIT
Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.
Trending New libraries in Natural Language Processing
by PaddlePaddle python
3119 Apache-2.0
Easy-to-use and Fast NLP library with awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications.
by DA-southampton python
2488
总结梳理自然语言处理工程师(NLP)需要积累的各方面知识,包括面试题,各种基础知识,工程能力等等,提升核心竞争力
by jbesomi python
2212 MIT
Text preprocessing, representation and visualization from zero to hero.
by EleutherAI python
2012 Apache-2.0
An implementation of model parallel autoregressive transformers on GPUs, based on the DeepSpeed library.
by CLUEbenchmark python
1760
搜索所有中文NLP数据集,附常用英文NLP数据集
by ivan-bilan python
1418 CC0-1.0
A comprehensive reference for all topics related to Natural Language Processing
by allenai python
1207 Apache-2.0
Longformer: The Long-Document Transformer
by yuchenlin python
1200 MIT
A simple tool to update bib entries with their official information (e.g., DBLP or the ACL anthology).
by nlpodyssey go
1116 BSD-2-Clause
Self-contained Machine Learning and Natural Language Processing library in Go
Top Authors in Natural Language Processing
1
43 Libraries
21327
2
40 Libraries
1095
3
40 Libraries
13881
4
37 Libraries
203
5
37 Libraries
6153
6
35 Libraries
9769
7
31 Libraries
41620
8
28 Libraries
8695
9
24 Libraries
739
10
22 Libraries
1302
1
43 Libraries
21327
2
40 Libraries
1095
3
40 Libraries
13881
4
37 Libraries
203
5
37 Libraries
6153
6
35 Libraries
9769
7
31 Libraries
41620
8
28 Libraries
8695
9
24 Libraries
739
10
22 Libraries
1302
Trending Kits in Natural Language Processing
One of the most popular system for visualizing numerical data in pandas is the boxplot. which can be created by calculating the quartiles of a data set. Box plots are among the most habituated types of graphs in business, statistics, and data analysis.
One way to plot a boxplot using the panda's data frame is to use the boxplot() function that's part of the panda's library. Boxplot is also used to discover the outlier in a data set. Pandas is a Python library built to streamline processes around acquiring and manipulating relational data that has built in methods for plotting and visualizing the values captured in its data structures. The plot() function is used to draw points in a diagram. The plot() function default draws a line from point to point. The function makes parameters for a particular point in the diagram
Box plots are mostly used to show distributions of numeric data values, especially when you want to compare them between multiple groups. These plots are also broadly used for comparing two data sets.
Here is an example of how we can create a boxplot of Grouped column
Preview of the output that you will get on running this code from your IDE
Code
In this solution we use the boxplot of python
- Copy the code using the "Copy" button above, and paste it in a Python file in your IDE.
- Create your own Dataframe that need to be boxploted
- Add the numPy Library
- Run the file to get the Output
- Add plt.show() at the end of the code to Display the output
I hope you found this useful. I have added the link to dependent libraries, version information in the following sections.
I found this code snippet by searching for "Plotting boxplots for a groupby object" in kandi. You can try any such use case!
Note
- In line 3 make sure the Import sentence starts with small I
- create your own Dataframe for example
df = pd.DataFrame({'Group':[1,1,1,2,3,2,2,3,1,3],'M':np.random.rand(10),'F':np.random.rand(10)})
df = df[['Group','M','F']]
Environment Tested
I tested this solution in the following versions. Be mindful of changes when working with other versions.
- The solution is created in Python 3.7.15. Version
- The solution is tested on numPy 1.21.6 Version
- The solution is tested on matplotlib 3.5.3 Version
- The solution is tested on Seaborn 0.12.2 Version
Using this solution, we can able to create boxplot of grouped column using python with the help of pandas library. This process also facilities an easy to use, hassle free method to create a hands-on working version of code which would help us create boxplot in python.
Dependent Library
If you do not have pandas ,matplotlib, seaborn, and numPy that is required to run this code, you can install it by clicking on the above link and copying the pip Install command from the Spacy page in kandi. You can search for any dependent library on kandi like numPy ,Pandas, matplotlib and seaborn
Support
- For any support on kandi solution kits, please use the chat
- For further learning resources, visit the Open Weaver Community learning page.
Popular Python package Spacy is used for natural language processing. Removing named entities from a text, such as people's names, is one of the things you can perform with Spacy. This can be done using the ents property of a Spacy document, which returns a list of named entities that have been identified in the text.
Removing names using Spacy can have several applications, including:
- Anonymizing text data: Removing names from text data can be useful for protecting the privacy of individuals mentioned in the data.
- Text summarization: Removing named entities, such as names of people, organizations, and locations, can help to reduce the amount of irrelevant information in a text and improve the overall readability of the summary.
- Text classification: Removing named entities can help to improve the performance of text classification models by reducing the amount of noise in the text and making it easier for the model to identify the relevant features.
- Sentiment Analysis: Removing names can help to improve the accuracy of sentiment analysis by reducing the amount of personal bias and opinion present in the text.
Here is how you can remove names using Spacy:
Preview of the output that you will get on running this code from your IDE
Code
In this solution we used spacy library of python.
- Copy the code using the "Copy" button above, and paste it in a Python file in your IDE.
- Enter the Text
- Run the file to annihilate Names in the text
I hope you found this useful. I have added the link to dependent libraries, version information in the following sections.
I found this code snippet by searching for "Using Spacy to remove names from a data frame "in kandi. You can try any such use case.
Note
In this snippet we are using a Language Model (en_core_wb_trf)
- Download the model using the command python-m spacy download en_core_web_trf.
- Paste it in your terminal and download it
Environment Tested
I tested this solution in the following versions. Be mindful of changes when working with other versions.
- The solution is created in Python 3.7.15 Version
- The solution is tested on Spacy 3.4.3 Version
- the solution is tested on Pandas 1.3.5 Version
Using this solution, we can remove Names in text with the help of function in spacy . This process also facilities an easy to use, hassle free method to create a hands-on working version of code which would help us remove names in the text in python.
Dependent Library
If you do not have SpaCy and pandas that is required to run this code, you can install it by clicking on the above link and copying the pip Install command from the Spacy page in kandi.
You can search for any dependent library on kandi like Spacy and pandas
Support
- For any support on kandi solution kits, please use the chat
- For further learning resources, visit the Open Weaver Community learning page
We will locate a specific group of words in a text using the SpaCy library, then replace those words with an empty string to remove them from the text.
Using SpaCy, it is possible to exclude words within a specific span from a text in the following ways:
- Text pre-processing: Removing specific words or phrases from text can be a useful step in pre-processing text data for NLP tasks such as text classification, sentiment analysis, and language translation.
- Document summarization: Maintaining only the most crucial information, specific words or phrases will serve to construct a summary of a lengthy text.
- Data cleaning: Anonymization and data cleaning can both benefit from removing sensitive or useless text information, such as names and addresses.
- Text generation: Adding context or meaning to the generated content might help create new text by deleting specific words or phrases.
- Text augmentation: Text can be used for text augmentation techniques in NLP by removing specific words or phrases and replacing them with new text variations.
Here is how you can remove words in span using SpaCy:
Preview of the output that you will get on running this code from your IDE
Code
In this solution we have used spacy library of python
- Copy the code using the "Copy" button above, and paste it in a Python file in your IDE.
- Enter the Text
- Run the code that Remove Specific words in the text
I hope you found this useful. I have added the link to dependent libraries, version information in the following sections.
I found this code snippet by searching for "Remove words in span from spacy" in kandi. You can try any such use case!
Note
In this snippet we are using a Language model (en_core_web_sm)
- Download the model using the command python -m spacy download en_core_web_sm .
- paste it in your terminal and download it.
Check the user's spacy version using pip show spacy command in users terminal.
- if its version 3.0, you will need to load it using nlp = spacy.load("en_core_web_sm")
- if its version is less than 3.0 you will need to load it using nlp = spacy.load("en")
Environment Tested
I tested this solution in the following versions. Be mindful of changes when working with other versions.
- The solution is created in Python 3.7.15 Version
- The solution is tested on Spacy 3.4.3 Version
Using this solution, we can collect nouns that ends with s-t-l with the help of function in spacy . This process also facilities an easy to use, hassle free method to create a hands-on working version of code which would help us use full stop whenever the user needs in the sentence in python.
Dependent Library
If you do not have SpaCy and numpy that is required to run this code, you can install it by clicking on the above link and copying the pip Install command from the Spacy page in kandi.
You can search for any dependent library on kandi like SpaCy and numpy
Support
- For any support on kandi solution kits, please use the chat
- For further learning resources, visit the Open Weaver Community learning page
Here are some of the famous C++ Natural Language Libraries. Some of the use cases of C++ Natural Language Libraries include Text Processing, Speech Recognition, Machine Translation, and Natural Language Understanding.
C++ natural language libraries are software libraries written in the C++ programming language that are used to process natural language, such as English, and extract meaning from text. These libraries are often used for natural language processing (NLP) applications, like text classification, sentiment analysis, and machine translation.
Let us have a look at some of the famous C++ Natural Language Libraries in detail below.
MITIE
- Designed to be highly scalable, allowing it to process large amounts of text quickly and efficiently.
- Uses a combination of statistical and machine learning techniques to identify relationships between words, phrases, and sentences.
- Written in C++, making it easy to integrate with existing applications and systems.
Gate
- Only library of its kind that offers multi-platform support for Windows, Mac, and Linux.
- Allows developers to annotate text with semantic information, enabling more powerful natural language processing applications.
- Only library of its kind that uses Java, making it more easily accessible to developers with existing Java skills.
spacy-cpp
- One of the fastest C++ natural language libraries, offering up to 30x faster performance than similar libraries.
- Includes features like tokenization, part-of-speech tagging, dependency parsing, and rule-based matching.
- Designed to scale well for large datasets, making it a good choice for enterprise-level applications.
snowball
- Offers a wide range of functions for stemming, lemmatization, and other natural language processing tasks.
- Able to handle most Unicode characters and works across different platforms.
- Offers powerful stemmers for multiple languages, including English, Spanish, French, German, Portuguese, and Italian.
aiml
- Flexible platform that supports a wide range of use cases.
- Designed to represent natural language.
- Powerful library that can be used to create complex conversations and interactions with users.
polyglot
- Designed for scalability, allowing developers to deploy applications on a distributed computing cluster.
- Offers a range of tools that make it easier to develop and deploy natural language processing applications.
- Designed to be highly portable, allowing developers to write code that can run on any platform and operating system.
NLTK
- Open-source, so it is available to anyone and can be modified to fit specific needs.
- Written in Python, making it more accessible and easier to use than other C++ natural language libraries.
- Has a graphical user interface, which makes it easy to explore the data and develop models.
wordnet
- Organized into semantic categories and hierarchical structures, allowing users to quickly find related words and their definitions.
- Provides access to synonyms and antonyms, making it unique from other C++ natural language libraries.
- Provides access to a corpus of example sentences and usage notes.
Tokenization is the division of a text string into discrete tokens. It offers the option to personalize tokenization by building a custom tokenizer.
There are several uses for customizing tokens in SpaCy, some of which include:
- Handling special input forms: A custom tokenizer can be used to handle specific input formats, such as those seen in emails or tweets, and tokenize the text in accordance.
- Enhancing model performance: Custom tokenization can help your model perform better by giving it access to more pertinent and instructive tokens.
- Managing non-standard text: Some text inputs may contain non-standard words or characters, which require special handling.
- Handling multi-language inputs: A custom tokenizer can be used to handle text inputs in multiple languages by using language-specific tokenization methods.
- Using customized tokenization in a particular field: Text can be tokenized appropriately by using customized tokenization in a particular field, such as the legal, medical, or scientific fields.
Here is how you can customize tokens in SpaCy:
Preview of the output that will get on running this code from your IDE
Code
In this solution we have used matcher function of Spacy library.
- Copy this code using "Copy" button above and paste it in your Python file IDE
- Enter the text that needed to be Tokenized
- Run the program to get Tokenize the given text.
I hope you found this useful i have added the Dependent Library ,versions and information in the following sections
I found this code snippet by searching "Customize Tokens using spacy" in Kandi. You can try any use case
Environment Tested
I tested this solution in the following version. Be mindful of changes when working with other versions
- This solution is created and executed in Python 3.7.15 version
- This solution is tested in Spacy on 3.4.3 version
Using this solution we can Tokenize the text which means it will break the text down into analytical units need for further processing. This process also facilities an easy to use, hassle free method to create a hands-on working version of code which would help us break the text in Python.
Dependent Libraries
If you do not have SpaCy that is required to run this code, you can install it by clicking on the above link and copying the pip Install command from the Spacy page in kandi. You can search for any dependent library on kandi like Spacy
Support
- For any support on kandi solution kits, please use the chat
- For further learning resources, visit the Open Weaver Community learning page.
Using SpaCy, you may utilize the techniques below to identify the full sentence that includes a particular keyword:
- Load the desired language model and import the SpaCy library.
- By feeding the text data via the SpaCy nlp object, you can process it.
- Repeat over the sentences in the processed text, ensuring that each one has the keyword.
Finding entire phrases that contain a particular term can be done using a variety of apps, including:
- Text mining: This technique can be used to extract pertinent facts from massive amounts of text data by looking for sentences that include a particular keyword.
- Information retrieval: Users can quickly locate pertinent information in a document or group of documents by searching for sentences that contain a particular keyword.
- Question-answering: Finding sentences that answer a question can help question-answering systems be more accurate.
- Text summarization: Finding sentences with essential words in them can aid in creating a summary of a text that accurately conveys its primary concepts.
- Evaluation of the language model: The ability of the language model to produce writing that is human-like can be assessed by locating full sentences that contain a keyword.
Here is how you can find the complete sentence that contains your keyword:
Preview of the output that you will get on running this code from your IDE
Code
In this solution we have used Matcher function of SpaCy Library.
- Copy the code using the "Copy" button above, and paste it in a Python file in your IDE.
- Enter the Text
- Run the code that Find the Complete Sentence you looking for.
I hope you found this useful. I have added the link to dependent libraries, version information in the following sections.
I found this code snippet by searching for "How to extract sentence with key phrases in SpaCy" in kandi. You can try any such use case!
Note
In this snippet we are using a Language model (en_core_web_sm)
- Download the model using the command python -m spacy download en_core_web_sm .
- paste it in your terminal and download it.
Check the user's spacy version using pip show spacy command in users terminal.
- if its version 3.0, you will need to load it using nlp = spacy.load("en_core_web_sm")
- if its version is less than 3.0 you will need to load it using nlp = spacy.load("en")
Environment Tested
I tested this solution in the following versions. Be mindful of changes when working with other versions.
- The solution is created in Python 3.7.15 Version
- The solution is tested on Spacy 3.4.3 Version
Using this solution, we can collect the complete sentence that user need with the help of function in spacy . This process also facilities an easy to use, hassle free method to create a hands-on working version of code which would help us collect the sentence or keywords the user needs in python.
Dependent Library
If you do not have SpaCy that is required to run this code, you can install it by clicking on the above link and copying the pip Install command from the Spacy page in kandi.
You can search for any dependent library on kandi like SpaCy
Support
- For any support on kandi solution kits, please use the chat
- For further learning resources, visit the Open Weaver Community learning page
In the spaCy library, a token refers to a single word or punctuation mark that is part of a larger document. Tokens have various attributes, such as the text of the token, its part-of-speech tag, and its dependency label, that can be used to extract information from the text and understand its meaning.
In spaCy, tokens can be merged into a single token using the “Doc.merge()” method. This method takes two arguments: the first is the start token, and the second is the end token of the span of tokens that you want to merge.
- Doc.merge(): This combines multiple individual tokens into a single token, which can be useful for various natural language processing tasks.
Merging spaCy tokens into a Doc allows you to group multiple individual tokens into a single token, which can be useful for various natural language processing tasks.
You may have a look at the code below for more information about merging SpaCy tokens into a doc.
Preview of the output that you will get on running this code from your IDE
Code
In this solution we have used spaCy - Retokenizer.merge Method from SpaCy.
- Copy the code using the "Copy" button above, and paste it in a Python file in your IDE.
- Enter the Text
- Run the file to Merge tokens in doc
I hope you found this useful. I have added the link to dependent libraries, version information in the following sections.
I found this code snippet by searching for "Merge sapcy tokens into a Doc " in kandi. You can try any such use case!
Environment Tested
I tested this solution in the following versions. Be mindful of changes when working with other versions.
- The solution is created in Python 3.7.15 Version
- The solution is tested on Spacy 3.4.3 Version
Using this solution, we can merge the tokens into doc with the help of function in spacy . This process also facilities an easy to use, hassle free method to create a hands-on working version of code which would help us merge the tokens in python.
Dependent Library
If you do not have SpaCy that is required to run this code, you can install it by clicking on the above link and copying the pip Install command from the Spacy page in kandi.
You can search for any dependent library on kandi like Spacy
Support
- For any support on kandi solution kits, please use the chat
- For further learning resources, visit the Open Weaver Community learning page
Hyphenated words have a hyphen (-) between two or more parts of the word. These parts of the word are often used to join commonly used words.
Tokenization is breaking down a piece of text into smaller units called tokens. Tokens are the basic building blocks of a text, and they can be words, phrases, sentences, or even individual characters, depending on the task and the granularity level required. The tokenization of hyphenated words can be tricky, as the hyphen can indicate different things depending on the context and the language. There are various ways to handle hyphenated words during tokenization, and the best method will depend on the specific task and the desired level of granularity.
- Treat the entire word as a single token: It treats the entire word, including the hyphen, as a single token.
- Treat the word as two separate tokens: This method splits the word into two separate tokens, one for each part of the word.
- Treat the hyphen as a separate token: This method treats the hyphen as a separate token.
You may have a look at the code below for more information about Tokenization of hyphenated words.
Preview of the output that you will get on running this code from your IDE
Code
In this solution we have used Tokenizer function of NLTK.
- Copy the code using the "Copy" button above, and paste it in a Python file in your IDE.
- Enter the Text
- Run the file to Tokenize the Hyphenated words
I hope you found this useful. I have added the link to dependent libraries, version information in the following sections.
I found this code snippet by searching for "Tokenization of Hyphenated Words" in kandi. You can try any such use case!
Note
In this snippet we are using a Language model (en_core_web_sm)
- Download the model using the command python -m spacy download en_core_web_sm .
- paste it in your terminal and download it.
Check the user's spacy version using pip show spacy command in users terminal.
- if its version 3.0, you will need to load it using nlp = spacy.load("en_core_web_sm")
- if its version is less than 3.0 you will need to load it using nlp = spacy.load("en")
Environment Tested
I tested this solution in the following versions. Be mindful of changes when working with other versions.
- The solution is created in Python 3.7.15 version.
- The solution is tested on Spacy 3.4.3 version.
Using this solution, we are able to Tokenize the Hyphenated words in Python with simple steps. This process also facilities an easy to use, hassle free method to create a hands-on working version of code which would help us Tokenize the words in Python.
Dependent Library
If you do not have SpaCy that is required to run this code, you can install it by clicking on the above link and copying the pip Install command from the Spacy page in kandi.
You can search for any dependent library on kandi like Spacy
Support
- For any support on kandi solution kits, please use the chat
- For further learning resources, visit the Open Weaver Community learning page
To remove names from noun chunks, you can use the SpaCy library in Python. You can first load the library and load a pre-trained model, then use the noun chunks attribute to extract all of the noun chunks in a given text. Then you can use a loop to iterate through each chunk and an if statement to check if the chunk contains a proper noun. If the chunk contains a proper noun, you can remove it from the text.
The removal of names from noun chunks has a variety of uses, such as:
- Data anonymization: To respect people's privacy and adhere to data protection laws, text can be made anonymous by removing personal identifiers.
- Text summarization: By omitting proper names from the text, it is possible to condense the length of a summary while maintaining the important points.
- Text classification: By lowering the amount of noise in the input data, removing proper names from text improves text classification algorithms' performance.
- Sentiment analysis: By removing proper names, sentiment analysis can be made more objective.
- Text-to-Speech: By removing appropriate names from the discourse, text-to-speech can sound more natural.
Here is how you can remove names from noun chunks:
Preview of the output that you will get on running this code from your IDE
Code
In this solution we have used Spacy library of python.
Instructions:
- Download and install VS Code on your desktop.
- Open VS Code and create a new file in the editor.
- Copy the code snippet that you want to run, using the "Copy" button or by selecting the text and using the copy command (Ctrl+C on Windows/Linux or Cmd+C on Mac).
- Paste the code into your file in VS Code, and save the file with a meaningful name.
- Open a terminal window or command prompt on your computer.
- For download spacy: use this command pip install spacy [3.4.3]
- Once spacy is installed, you can download the en_core_web_sm model using the following command: python -m spacy download en_core_web_sm Alternatively, you can install the model directly using pip: pip install en_core_web_sm
- To run the code, open the file in VS Code and click the "Run" button in the top menu, or use the keyboard shortcut Ctrl+Alt+N (on Windows and Linux) or Cmd+Alt+N (on Mac). The output of your code will appear in the VS Code output console.
I hope you found this useful. I have added the link to dependent libraries, version information in the following sections.
I found this code snippet by searching for "Remove Name from noun chuncks using SpaCy" in kandi. You can try any such use case!
Environment Tested
I tested this solution in the following versions. Be mindful of changes when working with other versions.
- The solution is created in Python 3.7.15 Version
- The solution is tested on Spacy 3.4.3 Version
Using this solution, we can Extract names from noun chuncks with the help of function in spacy . This process also facilities an easy to use, hassle free method to create a hands-on working version of code which would help us extract the Nouns python.
Dependent Library
If you do not have SpaCy that is required to run this code, you can install it by clicking on the above link and copying the pip Install command from the Spacy page in kandi.
You can search for any dependent library on kandi like SpaCy
Support
- For any support on kandi solution kits, please use the chat
- For further learning resources, visit the Open Weaver Community learning page
An attribute error occurs in Python when a code tries to access an attribute (i.e., a variable or method) that does not exist in an object or class. For example, if you try to access an instance variable that has not been defined, you will get an attribute error.
When using spaCy, an attribute error can happen if you try to access a property or attribute of an object (such as a token or doc) that is not declared or doesn't exist. To fix this, either use the “hasattr()” function to check whether the attribute is present before attempting to access it or check whether the attribute is present before attempting to access it.
- hasattr(): hasattr() is a built-in Python function that is used to check if an object has a given attribute. It takes two arguments: the object to check and the attribute’s name as a string. If the object has the attribute, hasattr() returns True. Otherwise, it returns False.
It is important to read the spaCy documentation to understand the properties and methods provided by spaCy for different objects.
You may have a look at the code below for more information about solving attribute errors using SpaCy.
Preview of the output that you will get on running this code from your IDE
Code
In this solution we have used Spacy library.
- Copy the code using the "Copy" button above, and paste it in a Python file in your IDE.
- Run the program to get the text to lemmatize
I hope you found this useful. I have added the link to dependent libraries, version information in the following sections.
I found this code snippet by searching for "How can i solve an attribute error when using SpaCy " in kandi. You can try any such use case!
Environment Tested
I tested this solution in the following versions. Be mindful of changes when working with other versions.
- The solution is created in Python 3.7.15 Version
- The solution is tested on Spacy 3.4.3 Version
Using this solution, we are going to Lemmatize the words with help of Spacy library. This process also facilities an easy to use, hassle free method to create a hands-on working version of code which would help us Lemmatize the words in Python.
Dependent Library
If you do not have SpaCy that is required to run this code, you can install it by clicking on the above link and copying the pip Install command from the Spacy page in kandi.
You can search for any dependent library on kandi like Spacy
Support
- For any support on kandi solution kits, please use the chat
- For further learning resources, visit the Open Weaver Community learning page.
Entities are specific pieces of information that can be extracted from a text. They can be categorized into different types: person, location, organization, event, product etc. These are some common entity types, but other entities may depend on the specific use case or domain.
Tagging entities in a string, also known as named-entity recognition (NER), is a way to extract structured information from unstructured text. Tagging entities involves identifying and classifying specific pieces of information, such as people, places, and organizations, and labeling them with specific tags or labels. There are several ways to tag entities in a string, some of which include:
- Regular expressions: This method uses pattern matching to identify entities in a string.
- Named Entity Recognition (NER): This method uses machine learning algorithms to identify entities in a string. It is commonly used in natural language processing tasks.
- Dictionary or lookup-based method: This method uses a pre-defined dictionary or lookup table to match entities in a string.
You may have a look at the code below for more information about tagging entities in string.
Preview of the output that you will get on running this code from your IDE
Code
In this solution we have used Spacy library.
- Copy the code using the "Copy" button above, and paste it in a Python file in your IDE.
- Enter the Text
- Run the file to Tag the entities in the string
I hope you found this useful. I have added the link to dependent libraries, version information in the following sections.
I found this code snippet by searching for "Tag entities in the string using Spacy " in kandi. You can try any such use case!
Note
In this snippet we are using a Language model (en_core_web_sm)
- Download the model using the command python -m spacy download en_core_web_sm .
- paste it in your terminal and download it.
Check the user's spacy version using pip show spacy command in users terminal.
- if its version 3.0, you will need to load it using nlp = spacy.load("en_core_web_sm")
- if its version is less than 3.0 you will need to load it using nlp = spacy.load("en")
Environment Tested
I tested this solution in the following versions. Be mindful of changes when working with other versions.
- The solution is created in Python 3.7.15 Version
- The solution is tested on Spacy 3.4.3 Version
Using this solution, tag entities in the string with the help of regular expression function in spacy . This process also facilities an easy to use, hassle free method to create a hands-on working version of code which would help us to tag the entities in python.
Dependent Library
If you do not have SpaCy that is required to run this code, you can install it by clicking on the above link and copying the pip Install command from the Spacy page in kandi.
You can search for any dependent library on kandi like Spacy
Support
- For any support on kandi solution kits, please use the chat
- For further learning resources, visit the Open Weaver Community learning page
The spaCy library provides the Doc object to represent a document, which can be tokenized into individual words or phrases (tokens) using the “doc.sents” and doc[i] attributes. You can convert a Doc object into a nested list of tokens by iterating through the sentences in the document, and then iterating through the tokens in each sentence.
- spaCy: spaCy is a library for advanced natural language processing in Python. It is designed specifically for production use, and it is fast and efficient. spaCy is widely used for natural language processing tasks such as named entity recognition, part-of-speech tagging, text classification, and others.
- Doc.sents: Doc.sents allows you to work with individual sentences easily and efficiently in a text, rather than having to manually split the text into sentences yourself. This can be useful in a variety of natural languages processing tasks, such as sentiment analysis or text summarization, where it's important to be able to work with individual sentences.
To learn more about the topic, you may have a look at the code below
Preview of the output that you will get on running this code from your IDE
Code
In this solution we used spaCy library of python.
- Copy the code using the "Copy" button above, and paste it in a Python file in your IDE.
- Import Sapcy library
- Run the file to turn spacy doc into nested List of tokens.
I hope you found this useful. I have added the link to dependent libraries, version information in the following sections.
I found this code snippet by searching for "How to turn spacy doc into nested list of tokens"in kandi. You can try any such use case.
Environment Tested
I tested this solution in the following versions. Be mindful of changes when working with other versions.
- The solution is created in Python 3.7.15 Version
- The solution is tested on Spacy 3.4.3 Version
Using this solution, we can turn the spacy doc into nested list in tokens with the help of function in spacy . This process also facilities an easy to use, hassle free method to create a hands-on working version of code which would help us turn the doc to nestled list in the text in python.
Dependent Library
If you do not have SpaCy that is required to run this code, you can install it by clicking on the above link and copying the pip Install command from the Spacy page in kandi.
You can search for any dependent library on kandi like Spacy
Support
- For any support on kandi solution kits, please use the chat
- For further learning resources, visit the Open Weaver Community learning page
"Matching a pattern" refers to finding occurrences of a certain word pattern or other linguistic components inside a text. A regular expression, which is a string of characters that forms a search pattern, is frequently used for this.
In Python, you can use the “re” module to match patterns in strings using regular expressions. There are several functions and techniques for locating and modifying patterns in strings available in the “re” module.
- re.search(): This is used to search for a specific pattern in a text. It is a component of the Python regular expression (re) module, which offers a collection of tools for using regular expressions.
A document's token patterns may be matched using the “Matcher” class in spaCy.
- Matcher: It returns the spans in the document that match a set of token patterns that are input.
- spaCy: With the help of this well-known Python module for natural language processing, users may interact with text data in a rapid and easy manner. It contains ready-to-use pre-trained models for a variety of languages. It is often used for a range of NLP applications in both industry and academics.
You can have a look at the code below to match the pattern using SpaCy.
Preview of the output that you will get on running this code from your IDE
Code
In this solution we use the Matcher function of the SpaCy library.
- Copy the code using the "Copy" button above, and paste it in a Python file in your IDE.
- Enter the Text that need to be matched
- Run the file to find a matching our pattern.
I hope you found this useful. I have added the link to dependent libraries, version information in the following sections.
I found this code snippet by searching for "Pattern in Spacy " in kandi. You can try any such use case!
Note
- In this solution the function takes only two arguments only, so please delete the None argument in Line 11.
- The new version of Spacy need brackets around Pattern. Therefore in this case close the pattern using square bracket in Line 11
Environment Tested
I tested this solution in the following versions. Be mindful of changes when working with other versions.
- The solution is created in Python 3.7.15 Version
- The solution is tested on Spacy 3.4.3 Version
Using this solution, we can able to find matchers to our pattern using python with the help of Spacy library. This process also facilities an easy to use, hassle free method to create a hands-on working version of code which would help us find matches to our pattern in python.
Dependent Library
If you do not have SpaCy that is required to run this code, you can install it by clicking on the above link and copying the pip Install command from the Spacy page in kandi.
You can search for any dependent library on kandi like Spacy
Support
- For any support on kandi solution kits, please use the chat
- For further learning resources, visit the Open Weaver Community learning page.
In spaCy, the Matcher class allows you to match patterns of tokens in a Doc object. The Matcher is initialized with a vocabulary, and then patterns can be added to the matcher using the “Matcher.add()” method. The patterns that are added to the matcher are defined using a list of dictionaries, where each dictionary represents a token and its attributes.
- Matcher.add(): In spaCy, the Matcher.add() method is used to add patterns to a Matcher object, which can then be used to find matches in a Doc object.
Once the patterns have been added to the matcher, you can use the “Matcher.matches()” method to find all instances of the specified patterns in a Doc object.
- Matcher.matches(): The method returns a list of tuples, where each tuple represents a match and contains the start and end index of the matching span in the Doc. This can be useful in various NLP tasks such as information extraction, text summarization, and sentiment analysis.
You may have a look at the code below for more information about SpaCy matcher patterns with specific nouns.
Preview of the output that you will get on running this code from your IDE
Code
In this solution we have used Matcher function of Spacy.
- Copy the code using the "Copy" button above, and paste it in a Python file in your IDE.
- Enter the Text
- Run the file to get the Nouns that ends with s-l-t.
I hope you found this useful. I have added the link to dependent libraries, version information in the following sections.
I found this code snippet by searching for "Spacy matcher pattern with specific Nouns" in kandi. You can try any such use case!
Environment Tested
I tested this solution in the following versions. Be mindful of changes when working with other versions.
- The solution is created in Python 3.7.15 Version
- The solution is tested on Spacy 3.4.3 Version
Using this solution, we can collect nouns that ends with s-t-l with the help of function in spacy . This process also facilities an easy to use, hassle free method to create a hands-on working version of code which would help us Collect the whatever the nouns user needs in python
Dependent Library
If you do not have SpaCy that is required to run this code, you can install it by clicking on the above link and copying the pip Install command from the Spacy page in kandi.
You can search for any dependent library on kandi like Spac
Support
- For any support on kandi solution kits, please use the chat
- For further learning resources, visit the Open Weaver Community learning page
In SpaCy, you can use the part-of-speech (POS) tagging functionality to classify words as nouns. POS tagging is the process of marking each word in a text with its corresponding POS tag.
Numerous uses for noun classification using SpaCy include:
- Information Extraction: You can extract important details about the subjects discussed in a document, such as individuals, organizations, places, etc., by identifying the nouns in the text.
- Text summarization: You can extract the key subjects or entities discussed in a text and use them to summarize the text by selecting important nouns in the text.
- Text classification: You can categorize a text into different categories or themes by determining the most prevalent nouns in the text.
- Text generation: You can create new material that is coherent and semantically equivalent to the original text by identifying the nouns in a text and the relationships between them.
- Named Entity Recognition (NER): SpaCy provides built-in support for NER, which can be used to extract entities from text with high accuracy.
- Query Expansion
- Language Translation
Here is how you can perform noun classification using SpaCy:
Preview of the output that you will get on running this code from your IDE
Code
in this solution we have used Spacy library.
Instructions:
- Download and install VS Code on your desktop.
- Open VS Code and create a new file in the editor.
- Copy the code snippet that you want to run, using the "Copy" button or by selecting the text and using the copy command (Ctrl+C on Windows/Linux or Cmd+C on Mac).
- Paste the code into your file in VS Code, and save the file with a meaningful name.
- Open a terminal window or command prompt on your computer.
- For download spacy: use this command pip install spacy [3.4.3]
- Once spacy is installed, you can download the en_core_web_sm model using the following command: python -m spacy download en_core_web_sm Alternatively, you can install the model directly using pip: pip install en_core_web_sm
- To run the code, open the file in VS Code and click the "Run" button in the top menu, or use the keyboard shortcut Ctrl+Alt+N (on Windows and Linux) or Cmd+Alt+N (on Mac). The output of your code will appear in the VS Code output console.
I hope you found this useful. I have added the link to dependent libraries, version information in the following sections.
I found this code snippet by searching for "Noun Classification using Spacy " in kandi. You can try any such use case!
Environment Tested
I tested this solution in the following versions. Be mindful of changes when working with other versions.
- The solution is created in Python 3.7.15 version
- The solution is tested on Spacy 3.4.3 version
Using this solution, we are able to collect the noun separately in the text with the help of Spacy Library in python. This process also facilities an easy to use, hassle free method to create a hands-on working version of code which would help us to collect Nouns using Python.
Dependent Library
If you do not have SpaCy that is required to run this code, you can install it by clicking on the above link and copying the pip Install command from the Spacy page in kandi. You can search for any dependent library on kandi like Spacy
Support
- For any support on kandi solution kits, please use the chat
- For further learning resources, visit the Open Weaver Community learning page.
The Dependency Matcher, a potent tool offered by the SpaCy library, can be used to match particular phrases based on the dependency parse of a sentence. Instead of matching word sequences based on their straightforward surface forms, the Dependency Matcher enables you to do so.
The SpaCy Dependency Matcher can be used in a variety of ways to match particular phrases based on their dependencies, such as:
- Text categorization: You can extract particular phrases from text using the Dependency Matcher.
- Information extraction: The Dependency Matcher can be used to extract specific data from language, including attributes, costs, and features of goods.
- Question answering: The Dependency Matcher can be used to identify the subject, verb, and object in a sentence to improve the accuracy of question answering systems.
- Text generation: By matching particular phrases based on their dependencies, the Dependency Matcher may produce text that is grammatically accurate and semantically relevant.
- Text summarization: The Dependency Matcher can be used to identify key phrases that capture the text's essential concepts and serve as a summary.
Here is how you can find the Spacy Regex Pharse Dependency Matcher in python
Preview of the output that you will get on running this code from your IDE
Code
In this solution we have used Pandas Library.
- Copy the code using the "Copy" button above, and paste it in a Python file in your IDE.
- Enter the Text
- Run the code to get the Output
I hope you found this useful. I have added the link to dependent libraries, version information in the following sections.
I found this code snippet by searching for "Spacy Regex Phrase using Dependency matcher" in kandi. You can try any such use case!
Environment Tested
I tested this solution in the following versions. Be mindful of changes when working with other versions.
- The solution is created in Python 3.7.15 Version
- The solution is tested on Spacy 3.4.3 Version
Using this solution, we can collect the sentence that user needs with the help of function in spacy . This process also facilities an easy to use, hassle free method to create a hands-on working version of code which would help us extract the sentence in python.
Dependent Library
If you do not have SpaCy that is required to run this code, you can install it by clicking on the above link and copying the pip Install command from the Spacy page in kandi.
You can search for any dependent library on kandi like SpaCy
Support
- For any support on kandi solution kits, please use the chat
- For further learning resources, visit the Open Weaver Community learning page
NLP helps build systems that can automatically analyze, process, summarize and extract meaning from natural language text. In a nutshell, NLP helps machines mimic human behaviour and allows us to build applications that can reason about different types of documents. NLP open-source libraries are tools that allow you to build your own NLP applications. These libraries can be used to develop many different types of applications, like Speech Recognition, chatbots, Sentimental Analysis, Email Spam Filtering, Language Translator, search engines, and question answering systems. NLTK is one of the most popular NLP libraries in Python. It provides easy-to-use interfaces to corpora and lexical resources such as WordNet, along with statistical models for common tasks such as part-of-speech tagging and noun phrase extraction. Following list has libraries for most basic Sentimental Analysis VADER (Valence Aware Dictionary and sEntiment Reasoner) Tool, collection of NLP resources – blogs, books, tutorials, and more. Check out the list of free, open source libraries to help you with your projects:
Some Popular Open Source Libraries to get you started
Utilize the below libraries to tokenize, implement part-of-speech tagging, stemming, sentiment analysis, topic segmentation, and named entity recognition.
Sentimental Analysis Repository
Some interesting courses to Deep Dive
1. List of Popular Courses on NLP 2. Stanford Course on Natural Language Processing 3. André Ribeiro Miranda- NLP Course
Recording from Session on Build AI fake News Detector
Watch recording of a live training session on AI Fake News Detection
Example project on AI Virtual Agent that you can build in 30 mins
Here's a project with the installer, source code, and step-by-step tutorial that you can build in under 30 mins. ⬇️Get the 1-Click install AI Virtual Agent kit Watch recording of a live training session on AI Virtual Agent
Natural Language Processing (NLP) is a broad subject that falls under the Artificial Intelligence (AI) domain. NLP allows computers to interpret text and spoken language in the same way that people do. NLP must be able to grasp not only words, but also phrases and paragraphs in their context based on syntax, grammar, and other factors. NLP algorithms break down human speech into machine-understandable fragments that can be utilized to create NLP-based software.
NLTK Natural Language Toolkit is one of the most frequently used libraries in the industry for building Python applications that interact with human language data. NLTK can assist you with anything from splitting sentences from paragraphs to recognizing the part of speech of specific phrases to emphasizing the primary theme. It is a highly important tool for preparing text for future analysis, such as when using Models. It assists in the translation of words into numbers, with which the model may subsequently function. This collection contains nearly all of the tools required for NLP. It helps with text classification, tokenization, parsing, part-of-speech tagging and stemming. spaCy spaCy is a python library built for sophisticated Natural Language Processing. It is based on cutting-edge research and was intended from the start to be utilized in real-world products. spaCy has pre-trained pipelines and presently supports tokenization and training for more than 60 languages. It includes cutting-edge speed and neural network models for tagging, parsing, named entity identification, text classification, and other tasks, as well as a production-ready training system and simple model packaging, deployment, and workflow management. Gensim Gensim is a well-known Python package for doing natural language processing tasks. It has a unique feature that uses vector space modeling and topic modeling tools to determine the semantic similarity between two documents.
CoreNLP CoreNLP can be used to create linguistic annotations for text, such as Token and sentence boundaries, Parts of speech, Named entities, Numeric and temporal values, dependency and constituency parser, Sentiment, Quotation attributions, and Relations between words. CoreNLP supports a variety of Human languages such as Arabic, Chinese, English, French, German, and Spanish. It is written in Java but has support for Python as well. Pattern Pattern is a python based NLP library that provides features such as part-of-speech tagging, sentiment analysis, and vector space modeling. It offers support for Twitter and Facebook APIs, a DOM parser, and a web crawler. Pattern is often used to convert HTML data to plain text and resolve spelling mistakes in textual data. Polyglot Polyglot library provides an impressive breadth of analysis and covers a wide range of languages. Polyglot's SpaCy-like efficiency and ease of use make it an excellent choice for projects that need a language that SpaCy does not support. The polyglot package provides a command-line interface as well as library access through pipeline methods.
TextBlob TextBlob is a python library that is often used for natural language processing (NLP) tasks such as voice tagging, noun phrase extraction, sentiment analysis, and classification. This library is based on the NLTK library. Its user-friendly interface provides access to basic NLP tasks such as sentiment analysis, word extraction, parsing, and many more. Flair Flair supports an increasing number of languages, you may apply the latest NLP models to your text, such as named entity recognition, part-of-speech tagging, and classification, as well as sense disambiguation and classification. It is a deep learning library built on top of PyTorch for NLP tasks. Flair natively provides pre-trained models for NLP tasks such asText classification, Part-of-Speech tagging and Name Entity Recognition
Paraphrasing refers to rewriting something in different words and using different expressions. It does not include changing the whole concept or meaning. It is a method in which we use words’ alternatives and different sentence structures. Paraphrasing is a restatement of any content or text. This is done by using a sentence re-phraser (Paraphraser). What is a good paraphrase? Almost all conditioned text generation models are validated on 2 factors, (1) if the generated text conveys the same meaning as the original context (Adequacy) (2) if the text is fluent / grammatically correct English (Fluency). For instance Neural Machine Translation outputs are tested for Adequacy and Fluency. But a good paraphrase should be adequate and fluent while being as different as possible on the surface lexical form. With respect to this definition, the 3 key metrics that measures the quality of paraphrases are:
- Adequacy (Is the meaning preserved adequately?)
- Fluency (Is the paraphrase fluent English?)
- Diversity (Lexical / Phrasal / Syntactical) (How much has the paraphrase changed the original sentence?)
- Data Augmentation: Paraphrasing helps in augmenting/creating training data for Natural Language Understanding(NLU) models to build robust models for conversational engines by creating equivalent paraphrases for a particular phrase or sentence thereby creating a text corpus as training data.
- Summarization: Paraphrasing helps to create summaries of a large text corpus for understanding the crux of the text corpus.
- Sentence Rephrasing: Paraphrasing helps in generating sentences with similar context for a particular phrase/sentence. These rephrased sentences can be used to create plagiarism free content for articles, blogs etc. A typical process flow to create training data by data augmentation using Paraphraser is picturized below:
Group Name 1
Troubleshooting
For Windows users: While you attempt to run the kit_installer batch file, you might be view a prompt from Microsoft Defender as below:
Artificial intelligence (AI) makes it possible for machines to learn from experience, adjust to new inputs and perform human-like tasks. Most AI examples that you hear about today – from chess-playing computers to self-driving cars – rely heavily on deep learning and natural language processing.
Group Name 1
Trending Discussions on Natural Language Processing
How can I convert this language to actual numbers and text?
Numpy: Get indices of boundaries in an array where starts of boundaries always start with a particular number; non-boundaries by a particular number
How to replace text between multiple tags based on character length
Remove duplicates from a tuple
ValueError: You must include at least one label and at least one sequence
RuntimeError: CUDA out of memory | Elastic Search
News article extract using requests,bs4 and newspaper packages. why doesn't links=soup.select(".r a") find anything?. This code was working earlier
Changing Stanza model directory for pyinstaller executable
Type-Token Ratio in Google Sheets: How to manipulate long strings of text (millions of characters)
a bug for tf.keras.layers.TextVectorization when built from saved configs and weights
QUESTION
How can I convert this language to actual numbers and text?
Asked 2022-Mar-06 at 13:11I am working on natural language processing project with deep learning and I downloaded a word embedding file. The file is in .bin
format. I can open that file with
1file = open("cbow.bin", "rb")
2
But when I type
1file = open("cbow.bin", "rb")
2file.read(100)
3
I get
1file = open("cbow.bin", "rb")
2file.read(100)
3b'4347907 300\n</s> H\xe1\xae:0\x16\xc1:\xbfX\xa7\xbaR8\x8f\xba\xa0\xd3\xee9K\xfe\x83::m\xa49\xbc\xbb\x938\xa4p\x9d\xbat\xdaA:UU\xbe\xba\x93_\xda9\x82N\x83\xb9\xaeG\xa7\xb9\xde\xdd\x90\xbaww$\xba\xfdba:\x14.\x84:R\xb8\x81:0\x96\x0b:\x96\xfc\x06'
4
What is this language and How can I convert it into actual numbers and text using python?
ANSWER
Answered 2022-Mar-06 at 13:11This weird language you are referring to is a python bytestring.
As @jolitti implied that you won't be able to convert this particular bytestring to readable text.
If the bytestring contained any characters you recognize then would have been displayed like this.
1file = open("cbow.bin", "rb")
2file.read(100)
3b'4347907 300\n</s> H\xe1\xae:0\x16\xc1:\xbfX\xa7\xbaR8\x8f\xba\xa0\xd3\xee9K\xfe\x83::m\xa49\xbc\xbb\x938\xa4p\x9d\xbat\xdaA:UU\xbe\xba\x93_\xda9\x82N\x83\xb9\xaeG\xa7\xb9\xde\xdd\x90\xbaww$\xba\xfdba:\x14.\x84:R\xb8\x81:0\x96\x0b:\x96\xfc\x06'
4b'Guido van Rossum'
5
QUESTION
Numpy: Get indices of boundaries in an array where starts of boundaries always start with a particular number; non-boundaries by a particular number
Asked 2022-Mar-01 at 00:01Problem:
The most computationally efficient solution to getting the indices of boundaries in an array where starts of boundaries always start with a particular number and non-boundaries are indicated by a different particular number.
Differences between this question and other boundary-based numpy questions on SO:
here are some other boundary based numpy questions
Numpy 1D array - find indices of boundaries of subsequences of the same number
Getting the boundary of numpy array shape with a hole
Extracting boundary of a numpy array
The difference between the question I am asking and other stackoverflow posts in my attempt to search for a solution is that the other boundaries are indicated by a jump in value, or a 'hole' of values.
What seems to be unique to my case is the starts of boundaries always start with a particular number.
Motivation:
This problem is inspired by IOB tagging in natural language processing. In IOB tagging, the start of a word is tagged with B [beginning] is the tag of the first letter in an entity, I [inside] is the tag for all other characters besides the first character in a word, and [O] is used to tag all non-entity characters
Example:
1import numpy as np
2
3a = np.array(
4 [
5 0, 0, 0, 1, 2, 2, 2, 0, 0, 0, 1, 0, 0, 1, 2, 1, 1, 0, 0, 1, 1, 1
6 ]
7)
8
1 is the start of each boundary. If a boundary has a length greater than one, then 2 makes up the rest of the boundary. 0 are non-boundary numbers.
The entities of these boundaries are 1, 2, 2, 2
, 1
, 1,2
, 1
, 1
, 1
, 1
, 1
So the desired solution; the indices of the indices boundary values for a
are
1import numpy as np
2
3a = np.array(
4 [
5 0, 0, 0, 1, 2, 2, 2, 0, 0, 0, 1, 0, 0, 1, 2, 1, 1, 0, 0, 1, 1, 1
6 ]
7)
8desired = [[3, 6], [10, 10], [13, 14], [15, 15], [16,16], [19,19], [20,20], [21,21]]
9
Current Solution:
If flattened, the numbers in the desired solution are in ascending order. So the raw indices numbers can be calculated, sorted, and reshaped later.
I can get the start indices using
1import numpy as np
2
3a = np.array(
4 [
5 0, 0, 0, 1, 2, 2, 2, 0, 0, 0, 1, 0, 0, 1, 2, 1, 1, 0, 0, 1, 1, 1
6 ]
7)
8desired = [[3, 6], [10, 10], [13, 14], [15, 15], [16,16], [19,19], [20,20], [21,21]]
9starts = np.where(a==1)[0]
10starts
11
1import numpy as np
2
3a = np.array(
4 [
5 0, 0, 0, 1, 2, 2, 2, 0, 0, 0, 1, 0, 0, 1, 2, 1, 1, 0, 0, 1, 1, 1
6 ]
7)
8desired = [[3, 6], [10, 10], [13, 14], [15, 15], [16,16], [19,19], [20,20], [21,21]]
9starts = np.where(a==1)[0]
10starts
11array([ 3, 10, 13, 15, 16, 19, 20, 21])
12
So what's left is 6, 10, 14, 15,16,19,20,21
I can get all except 1 using 3 different conditionals where I can compare a shifted array to the original by decreases in values and the values of the non-shifted array.
1import numpy as np
2
3a = np.array(
4 [
5 0, 0, 0, 1, 2, 2, 2, 0, 0, 0, 1, 0, 0, 1, 2, 1, 1, 0, 0, 1, 1, 1
6 ]
7)
8desired = [[3, 6], [10, 10], [13, 14], [15, 15], [16,16], [19,19], [20,20], [21,21]]
9starts = np.where(a==1)[0]
10starts
11array([ 3, 10, 13, 15, 16, 19, 20, 21])
12first = np.where(a[:-1] - 2 == a[1:])[0]
13first
14
1import numpy as np
2
3a = np.array(
4 [
5 0, 0, 0, 1, 2, 2, 2, 0, 0, 0, 1, 0, 0, 1, 2, 1, 1, 0, 0, 1, 1, 1
6 ]
7)
8desired = [[3, 6], [10, 10], [13, 14], [15, 15], [16,16], [19,19], [20,20], [21,21]]
9starts = np.where(a==1)[0]
10starts
11array([ 3, 10, 13, 15, 16, 19, 20, 21])
12first = np.where(a[:-1] - 2 == a[1:])[0]
13first
14array([6])
15
1import numpy as np
2
3a = np.array(
4 [
5 0, 0, 0, 1, 2, 2, 2, 0, 0, 0, 1, 0, 0, 1, 2, 1, 1, 0, 0, 1, 1, 1
6 ]
7)
8desired = [[3, 6], [10, 10], [13, 14], [15, 15], [16,16], [19,19], [20,20], [21,21]]
9starts = np.where(a==1)[0]
10starts
11array([ 3, 10, 13, 15, 16, 19, 20, 21])
12first = np.where(a[:-1] - 2 == a[1:])[0]
13first
14array([6])
15second = np.where((a[:-1] - 1 == a[1:]) &
16 ((a[1:]==1) | (a[1:]==0)))[0]
17second
18
1import numpy as np
2
3a = np.array(
4 [
5 0, 0, 0, 1, 2, 2, 2, 0, 0, 0, 1, 0, 0, 1, 2, 1, 1, 0, 0, 1, 1, 1
6 ]
7)
8desired = [[3, 6], [10, 10], [13, 14], [15, 15], [16,16], [19,19], [20,20], [21,21]]
9starts = np.where(a==1)[0]
10starts
11array([ 3, 10, 13, 15, 16, 19, 20, 21])
12first = np.where(a[:-1] - 2 == a[1:])[0]
13first
14array([6])
15second = np.where((a[:-1] - 1 == a[1:]) &
16 ((a[1:]==1) | (a[1:]==0)))[0]
17second
18array([10, 14, 16])
19
1import numpy as np
2
3a = np.array(
4 [
5 0, 0, 0, 1, 2, 2, 2, 0, 0, 0, 1, 0, 0, 1, 2, 1, 1, 0, 0, 1, 1, 1
6 ]
7)
8desired = [[3, 6], [10, 10], [13, 14], [15, 15], [16,16], [19,19], [20,20], [21,21]]
9starts = np.where(a==1)[0]
10starts
11array([ 3, 10, 13, 15, 16, 19, 20, 21])
12first = np.where(a[:-1] - 2 == a[1:])[0]
13first
14array([6])
15second = np.where((a[:-1] - 1 == a[1:]) &
16 ((a[1:]==1) | (a[1:]==0)))[0]
17second
18array([10, 14, 16])
19third = np.where(
20 (a[:-1] == a[1:]) &
21 (a[1:]==1)
22 )[0]
23third
24
1import numpy as np
2
3a = np.array(
4 [
5 0, 0, 0, 1, 2, 2, 2, 0, 0, 0, 1, 0, 0, 1, 2, 1, 1, 0, 0, 1, 1, 1
6 ]
7)
8desired = [[3, 6], [10, 10], [13, 14], [15, 15], [16,16], [19,19], [20,20], [21,21]]
9starts = np.where(a==1)[0]
10starts
11array([ 3, 10, 13, 15, 16, 19, 20, 21])
12first = np.where(a[:-1] - 2 == a[1:])[0]
13first
14array([6])
15second = np.where((a[:-1] - 1 == a[1:]) &
16 ((a[1:]==1) | (a[1:]==0)))[0]
17second
18array([10, 14, 16])
19third = np.where(
20 (a[:-1] == a[1:]) &
21 (a[1:]==1)
22 )[0]
23third
24array([15, 19, 20])
25
The last number I need is 21
, but since I needed to shorten the length of the array by 1 to do the shifted comparisons, I'm not sure how to get that particular value using logic, so I just used a simple if statement for that.
Using the rest of the retrieved values for the indices, I can concatenate all the values and reshape them.
1import numpy as np
2
3a = np.array(
4 [
5 0, 0, 0, 1, 2, 2, 2, 0, 0, 0, 1, 0, 0, 1, 2, 1, 1, 0, 0, 1, 1, 1
6 ]
7)
8desired = [[3, 6], [10, 10], [13, 14], [15, 15], [16,16], [19,19], [20,20], [21,21]]
9starts = np.where(a==1)[0]
10starts
11array([ 3, 10, 13, 15, 16, 19, 20, 21])
12first = np.where(a[:-1] - 2 == a[1:])[0]
13first
14array([6])
15second = np.where((a[:-1] - 1 == a[1:]) &
16 ((a[1:]==1) | (a[1:]==0)))[0]
17second
18array([10, 14, 16])
19third = np.where(
20 (a[:-1] == a[1:]) &
21 (a[1:]==1)
22 )[0]
23third
24array([15, 19, 20])
25if (a[-1] == 1) | (a[-1] == 2):
26 pen = np.concatenate((
27 starts, first, second, third, np.array([a.shape[0]-1])
28 ))
29else:
30 pen = np.concatenate((
31 starts, first, second, third,
32 ))
33np.sort(pen).reshape(-1,2)
34
1import numpy as np
2
3a = np.array(
4 [
5 0, 0, 0, 1, 2, 2, 2, 0, 0, 0, 1, 0, 0, 1, 2, 1, 1, 0, 0, 1, 1, 1
6 ]
7)
8desired = [[3, 6], [10, 10], [13, 14], [15, 15], [16,16], [19,19], [20,20], [21,21]]
9starts = np.where(a==1)[0]
10starts
11array([ 3, 10, 13, 15, 16, 19, 20, 21])
12first = np.where(a[:-1] - 2 == a[1:])[0]
13first
14array([6])
15second = np.where((a[:-1] - 1 == a[1:]) &
16 ((a[1:]==1) | (a[1:]==0)))[0]
17second
18array([10, 14, 16])
19third = np.where(
20 (a[:-1] == a[1:]) &
21 (a[1:]==1)
22 )[0]
23third
24array([15, 19, 20])
25if (a[-1] == 1) | (a[-1] == 2):
26 pen = np.concatenate((
27 starts, first, second, third, np.array([a.shape[0]-1])
28 ))
29else:
30 pen = np.concatenate((
31 starts, first, second, third,
32 ))
33np.sort(pen).reshape(-1,2)
34array([[ 3, 6],
35 [10, 10],
36 [13, 14],
37 [15, 15],
38 [16, 16],
39 [19, 19],
40 [20, 20],
41 [21, 21]])
42
Is this the most computationally efficient solution for my answer? I realize the four where statements can be combined with or
operators but wanted to have each separate for the reader to see each result in this post. But I am wondering if there is a more computationally efficient solution since I have not mastered all of numpy's functions and am unsure of the computational efficiency of each.
ANSWER
Answered 2022-Mar-01 at 00:01A standard trick for this type of problem is to pad the input appropriately. In this case, it is helpful to append a 0
to the end of the array:
1import numpy as np
2
3a = np.array(
4 [
5 0, 0, 0, 1, 2, 2, 2, 0, 0, 0, 1, 0, 0, 1, 2, 1, 1, 0, 0, 1, 1, 1
6 ]
7)
8desired = [[3, 6], [10, 10], [13, 14], [15, 15], [16,16], [19,19], [20,20], [21,21]]
9starts = np.where(a==1)[0]
10starts
11array([ 3, 10, 13, 15, 16, 19, 20, 21])
12first = np.where(a[:-1] - 2 == a[1:])[0]
13first
14array([6])
15second = np.where((a[:-1] - 1 == a[1:]) &
16 ((a[1:]==1) | (a[1:]==0)))[0]
17second
18array([10, 14, 16])
19third = np.where(
20 (a[:-1] == a[1:]) &
21 (a[1:]==1)
22 )[0]
23third
24array([15, 19, 20])
25if (a[-1] == 1) | (a[-1] == 2):
26 pen = np.concatenate((
27 starts, first, second, third, np.array([a.shape[0]-1])
28 ))
29else:
30 pen = np.concatenate((
31 starts, first, second, third,
32 ))
33np.sort(pen).reshape(-1,2)
34array([[ 3, 6],
35 [10, 10],
36 [13, 14],
37 [15, 15],
38 [16, 16],
39 [19, 19],
40 [20, 20],
41 [21, 21]])
42In [55]: a1 = np.concatenate((a, [0]))
43
44In [56]: a1
45Out[56]:
46array([0, 0, 0, 1, 2, 2, 2, 0, 0, 0, 1, 0, 0, 1, 2, 1, 1, 0, 0, 1, 1, 1,
47 0])
48
Then your starts
calculation still works:
1import numpy as np
2
3a = np.array(
4 [
5 0, 0, 0, 1, 2, 2, 2, 0, 0, 0, 1, 0, 0, 1, 2, 1, 1, 0, 0, 1, 1, 1
6 ]
7)
8desired = [[3, 6], [10, 10], [13, 14], [15, 15], [16,16], [19,19], [20,20], [21,21]]
9starts = np.where(a==1)[0]
10starts
11array([ 3, 10, 13, 15, 16, 19, 20, 21])
12first = np.where(a[:-1] - 2 == a[1:])[0]
13first
14array([6])
15second = np.where((a[:-1] - 1 == a[1:]) &
16 ((a[1:]==1) | (a[1:]==0)))[0]
17second
18array([10, 14, 16])
19third = np.where(
20 (a[:-1] == a[1:]) &
21 (a[1:]==1)
22 )[0]
23third
24array([15, 19, 20])
25if (a[-1] == 1) | (a[-1] == 2):
26 pen = np.concatenate((
27 starts, first, second, third, np.array([a.shape[0]-1])
28 ))
29else:
30 pen = np.concatenate((
31 starts, first, second, third,
32 ))
33np.sort(pen).reshape(-1,2)
34array([[ 3, 6],
35 [10, 10],
36 [13, 14],
37 [15, 15],
38 [16, 16],
39 [19, 19],
40 [20, 20],
41 [21, 21]])
42In [55]: a1 = np.concatenate((a, [0]))
43
44In [56]: a1
45Out[56]:
46array([0, 0, 0, 1, 2, 2, 2, 0, 0, 0, 1, 0, 0, 1, 2, 1, 1, 0, 0, 1, 1, 1,
47 0])
48In [57]: starts = np.where(a1 == 1)[0]
49
50In [58]: starts
51Out[58]: array([ 3, 10, 13, 15, 16, 19, 20, 21])
52
The condition for the end is that the value is a 1
or a 2
followed by a value that is not 2
. You've already figured out that to handle the "followed by" condition, you can use a shifted version of the array. To implement the and
and or
conditions, use the bitwise binary operators &
and |
, respectiveley. In code, it looks like:
1import numpy as np
2
3a = np.array(
4 [
5 0, 0, 0, 1, 2, 2, 2, 0, 0, 0, 1, 0, 0, 1, 2, 1, 1, 0, 0, 1, 1, 1
6 ]
7)
8desired = [[3, 6], [10, 10], [13, 14], [15, 15], [16,16], [19,19], [20,20], [21,21]]
9starts = np.where(a==1)[0]
10starts
11array([ 3, 10, 13, 15, 16, 19, 20, 21])
12first = np.where(a[:-1] - 2 == a[1:])[0]
13first
14array([6])
15second = np.where((a[:-1] - 1 == a[1:]) &
16 ((a[1:]==1) | (a[1:]==0)))[0]
17second
18array([10, 14, 16])
19third = np.where(
20 (a[:-1] == a[1:]) &
21 (a[1:]==1)
22 )[0]
23third
24array([15, 19, 20])
25if (a[-1] == 1) | (a[-1] == 2):
26 pen = np.concatenate((
27 starts, first, second, third, np.array([a.shape[0]-1])
28 ))
29else:
30 pen = np.concatenate((
31 starts, first, second, third,
32 ))
33np.sort(pen).reshape(-1,2)
34array([[ 3, 6],
35 [10, 10],
36 [13, 14],
37 [15, 15],
38 [16, 16],
39 [19, 19],
40 [20, 20],
41 [21, 21]])
42In [55]: a1 = np.concatenate((a, [0]))
43
44In [56]: a1
45Out[56]:
46array([0, 0, 0, 1, 2, 2, 2, 0, 0, 0, 1, 0, 0, 1, 2, 1, 1, 0, 0, 1, 1, 1,
47 0])
48In [57]: starts = np.where(a1 == 1)[0]
49
50In [58]: starts
51Out[58]: array([ 3, 10, 13, 15, 16, 19, 20, 21])
52In [61]: ends = np.where((a1[:-1] != 0) & (a1[1:] != 2))[0]
53
54In [62]: ends
55Out[62]: array([ 6, 10, 14, 15, 16, 19, 20, 21])
56
Finally, put starts
and ends
into a single array:
1import numpy as np
2
3a = np.array(
4 [
5 0, 0, 0, 1, 2, 2, 2, 0, 0, 0, 1, 0, 0, 1, 2, 1, 1, 0, 0, 1, 1, 1
6 ]
7)
8desired = [[3, 6], [10, 10], [13, 14], [15, 15], [16,16], [19,19], [20,20], [21,21]]
9starts = np.where(a==1)[0]
10starts
11array([ 3, 10, 13, 15, 16, 19, 20, 21])
12first = np.where(a[:-1] - 2 == a[1:])[0]
13first
14array([6])
15second = np.where((a[:-1] - 1 == a[1:]) &
16 ((a[1:]==1) | (a[1:]==0)))[0]
17second
18array([10, 14, 16])
19third = np.where(
20 (a[:-1] == a[1:]) &
21 (a[1:]==1)
22 )[0]
23third
24array([15, 19, 20])
25if (a[-1] == 1) | (a[-1] == 2):
26 pen = np.concatenate((
27 starts, first, second, third, np.array([a.shape[0]-1])
28 ))
29else:
30 pen = np.concatenate((
31 starts, first, second, third,
32 ))
33np.sort(pen).reshape(-1,2)
34array([[ 3, 6],
35 [10, 10],
36 [13, 14],
37 [15, 15],
38 [16, 16],
39 [19, 19],
40 [20, 20],
41 [21, 21]])
42In [55]: a1 = np.concatenate((a, [0]))
43
44In [56]: a1
45Out[56]:
46array([0, 0, 0, 1, 2, 2, 2, 0, 0, 0, 1, 0, 0, 1, 2, 1, 1, 0, 0, 1, 1, 1,
47 0])
48In [57]: starts = np.where(a1 == 1)[0]
49
50In [58]: starts
51Out[58]: array([ 3, 10, 13, 15, 16, 19, 20, 21])
52In [61]: ends = np.where((a1[:-1] != 0) & (a1[1:] != 2))[0]
53
54In [62]: ends
55Out[62]: array([ 6, 10, 14, 15, 16, 19, 20, 21])
56In [63]: np.column_stack((starts, ends))
57Out[63]:
58array([[ 3, 6],
59 [10, 10],
60 [13, 14],
61 [15, 15],
62 [16, 16],
63 [19, 19],
64 [20, 20],
65 [21, 21]])
66
QUESTION
How to replace text between multiple tags based on character length
Asked 2022-Feb-11 at 14:53I am dealing with dirty text data (and not with valid html). I am doing natural language processing and short code snippets shouldn't be removed because they can contain valuable information while long code snippets don't.
Thats why I would like to remove text between code tags only if the content that will be removed has character length > n
.
Let's say the number of allowed characters between two code tags is n <= 5
. Then everything between those tags that is longer than 5 characters will be removed.
My approach so far deletes all of the code characters:
1text = "This is a string <code>1234</code> another string <code>123</code> another string <code>123456789</code> another string."
2text = re.sub("<code>.*?</code>", '', text)
3print(text)
4
5Output: This is a string another string another string another string.
6
The desired output:
1text = "This is a string <code>1234</code> another string <code>123</code> another string <code>123456789</code> another string."
2text = re.sub("<code>.*?</code>", '', text)
3print(text)
4
5Output: This is a string another string another string another string.
6"This is a string <code>1234</code> another string <code>123</code> another string another string."
7
Is there a way to count the text length for all of the appearing <code ... </code>
tags before it will actually be removed?
ANSWER
Answered 2022-Feb-11 at 14:53In Python, BeautifulSoup is often used to manipulate HTML/XML contents. If you use this library, you can use something like
1text = "This is a string <code>1234</code> another string <code>123</code> another string <code>123456789</code> another string."
2text = re.sub("<code>.*?</code>", '', text)
3print(text)
4
5Output: This is a string another string another string another string.
6"This is a string <code>1234</code> another string <code>123</code> another string another string."
7from bs4 import BeautifulSoup
8soup = BeautifulSoup(content,"html.parser")
9text = "This is a string <code>1234</code> another string <code>123</code> another string <code>123456789</code> another string."
10soup = BeautifulSoup(text,"html.parser")
11for code in soup.find_all("code"):
12 if len(code.encode_contents()) > 5: # Check the inner HTML length
13 code.extract() # Remove the node found
14
15print(str(soup))
16# => This is a string <code>1234</code> another string <code>123</code> another string another string.
17
Note that here, the length of the inner HTML part is taken into account, not the inner text.
With regex, you can use a negated character class pattern, [^<]
, to match any char other than <
, and apply a limiting quantifier to it. If all longer than 5 chars should be removed, use {6,}
quantifier:
1text = "This is a string <code>1234</code> another string <code>123</code> another string <code>123456789</code> another string."
2text = re.sub("<code>.*?</code>", '', text)
3print(text)
4
5Output: This is a string another string another string another string.
6"This is a string <code>1234</code> another string <code>123</code> another string another string."
7from bs4 import BeautifulSoup
8soup = BeautifulSoup(content,"html.parser")
9text = "This is a string <code>1234</code> another string <code>123</code> another string <code>123456789</code> another string."
10soup = BeautifulSoup(text,"html.parser")
11for code in soup.find_all("code"):
12 if len(code.encode_contents()) > 5: # Check the inner HTML length
13 code.extract() # Remove the node found
14
15print(str(soup))
16# => This is a string <code>1234</code> another string <code>123</code> another string another string.
17import re
18text = "This is a string <code>1234</code> another string <code>123</code> another string <code>123456789</code> another string."
19text = re.sub(r'<code>[^>]{6,}</code>', '', text)
20print(text)
21# => This is a string <code>1234</code> another string <code>123</code> another string another string.
22
See this Python demo.
QUESTION
Remove duplicates from a tuple
Asked 2022-Feb-09 at 23:43I tried to extract keywords from a text. By using "en_core_sci_lg" model, I got a tuple type of phrases/words with some duplicates which I tried to remove from it. I tried deduplicate function for list and tuple, I only got fail. Can anyone help? I really appreciate it.
1text = """spaCy is an open-source software library for advanced natural language processing,
2written in the programming languages Python and Cython. The MIT library is published under the MIT license and its main developers are Matthew Honnibal and Ines Honnibal, the founders of the software company Explosion."""
3
one sets of codes I have tried:
1text = """spaCy is an open-source software library for advanced natural language processing,
2written in the programming languages Python and Cython. The MIT library is published under the MIT license and its main developers are Matthew Honnibal and Ines Honnibal, the founders of the software company Explosion."""
3import spacy
4nlp = spacy.load("en_core_sci_lg")
5
6doc = nlp(text)
7my_tuple = list(set(doc.ents))
8print('original tuple', doc.ents, len(doc.ents))
9print('after set function', my_tuple, len(my_tuple))
10
the output:
1text = """spaCy is an open-source software library for advanced natural language processing,
2written in the programming languages Python and Cython. The MIT library is published under the MIT license and its main developers are Matthew Honnibal and Ines Honnibal, the founders of the software company Explosion."""
3import spacy
4nlp = spacy.load("en_core_sci_lg")
5
6doc = nlp(text)
7my_tuple = list(set(doc.ents))
8print('original tuple', doc.ents, len(doc.ents))
9print('after set function', my_tuple, len(my_tuple))
10original tuple: (spaCy, open-source software library, programming languages, Python, Cython, MIT, library, published, MIT, license, developers, Matthew Honnibal, Ines, Honnibal, founders, software company Explosion) 16
11
12after set function: [Honnibal, MIT, Ines, software company Explosion, founders, programming languages, library, Matthew Honnibal, license, Cython, Python, developers, MIT, published, open-source software library, spaCy] 16
13
the desired output is(there should be one MIT, and the name Ines Honnibal should be together):
1text = """spaCy is an open-source software library for advanced natural language processing,
2written in the programming languages Python and Cython. The MIT library is published under the MIT license and its main developers are Matthew Honnibal and Ines Honnibal, the founders of the software company Explosion."""
3import spacy
4nlp = spacy.load("en_core_sci_lg")
5
6doc = nlp(text)
7my_tuple = list(set(doc.ents))
8print('original tuple', doc.ents, len(doc.ents))
9print('after set function', my_tuple, len(my_tuple))
10original tuple: (spaCy, open-source software library, programming languages, Python, Cython, MIT, library, published, MIT, license, developers, Matthew Honnibal, Ines, Honnibal, founders, software company Explosion) 16
11
12after set function: [Honnibal, MIT, Ines, software company Explosion, founders, programming languages, library, Matthew Honnibal, license, Cython, Python, developers, MIT, published, open-source software library, spaCy] 16
13[Ines Honnibal, MIT, software company Explosion, founders, programming languages, library, Matthew Honnibal, license, Cython, Python, developers, published, open-source software library, spaCy]
14
ANSWER
Answered 2022-Feb-09 at 22:08doc.ents
is not a list of strings. It is a list of Span
objects. When you print one, it prints its contents, but they are indeed individual objects, which is why set
doesn't see they are duplicates. The clue to that is there are no quote marks in your print statement. If those were strings, you'd see quotation marks.
You should try using doc.words
instead of doc.ents
. If that doesn't work for you, for some reason, you can do:
1text = """spaCy is an open-source software library for advanced natural language processing,
2written in the programming languages Python and Cython. The MIT library is published under the MIT license and its main developers are Matthew Honnibal and Ines Honnibal, the founders of the software company Explosion."""
3import spacy
4nlp = spacy.load("en_core_sci_lg")
5
6doc = nlp(text)
7my_tuple = list(set(doc.ents))
8print('original tuple', doc.ents, len(doc.ents))
9print('after set function', my_tuple, len(my_tuple))
10original tuple: (spaCy, open-source software library, programming languages, Python, Cython, MIT, library, published, MIT, license, developers, Matthew Honnibal, Ines, Honnibal, founders, software company Explosion) 16
11
12after set function: [Honnibal, MIT, Ines, software company Explosion, founders, programming languages, library, Matthew Honnibal, license, Cython, Python, developers, MIT, published, open-source software library, spaCy] 16
13[Ines Honnibal, MIT, software company Explosion, founders, programming languages, library, Matthew Honnibal, license, Cython, Python, developers, published, open-source software library, spaCy]
14my_tuple = list(set(e.text for e in doc.ents))
15
QUESTION
ValueError: You must include at least one label and at least one sequence
Asked 2021-Dec-14 at 09:15I'm using this Notebook, where section Apply DocumentClassifier is altered as below.
Jupyter Labs, kernel: conda_mxnet_latest_p37
.
Error appears to be an ML standard practice response. However, I pass/ create the same parameter and the variable names as the original code. So it's something to do with their values in my code.
My Code:
1with open('filt_gri.txt', 'r') as filehandle:
2 tags = [current_place.rstrip() for current_place in filehandle.readlines()]
3
4doc_classifier = TransformersDocumentClassifier(model_name_or_path="cross-encoder/nli-distilroberta-base",
5 task="zero-shot-classification",
6 labels=tags,
7 batch_size=16)
8
9# convert to Document using a fieldmap for custom content fields the classification should run on
10docs_to_classify = [Document.from_dict(d) for d in docs_sliding_window]
11
12# classify using gpu, batch_size makes sure we do not run out of memory
13classified_docs = doc_classifier.predict(docs_to_classify)
14
15# let's see how it looks: there should be a classification result in the meta entry containing labels and scores.
16print(classified_docs[0].to_dict())
17
18all_docs = convert_files_to_dicts(dir_path=doc_dir)
19
20preprocessor_sliding_window = PreProcessor(split_overlap=3,
21 split_length=10,
22 split_respect_sentence_boundary=False,
23 split_by='passage')
24
Output:
1with open('filt_gri.txt', 'r') as filehandle:
2 tags = [current_place.rstrip() for current_place in filehandle.readlines()]
3
4doc_classifier = TransformersDocumentClassifier(model_name_or_path="cross-encoder/nli-distilroberta-base",
5 task="zero-shot-classification",
6 labels=tags,
7 batch_size=16)
8
9# convert to Document using a fieldmap for custom content fields the classification should run on
10docs_to_classify = [Document.from_dict(d) for d in docs_sliding_window]
11
12# classify using gpu, batch_size makes sure we do not run out of memory
13classified_docs = doc_classifier.predict(docs_to_classify)
14
15# let's see how it looks: there should be a classification result in the meta entry containing labels and scores.
16print(classified_docs[0].to_dict())
17
18all_docs = convert_files_to_dicts(dir_path=doc_dir)
19
20preprocessor_sliding_window = PreProcessor(split_overlap=3,
21 split_length=10,
22 split_respect_sentence_boundary=False,
23 split_by='passage')
24INFO - haystack.modeling.utils - Using devices: CUDA
25INFO - haystack.modeling.utils - Number of GPUs: 1
26---------------------------------------------------------------------------
27ValueError Traceback (most recent call last)
28<ipython-input-11-77eb98038283> in <module>
29 14
30 15 # classify using gpu, batch_size makes sure we do not run out of memory
31---> 16 classified_docs = doc_classifier.predict(docs_to_classify)
32 17
33 18 # let's see how it looks: there should be a classification result in the meta entry containing labels and scores.
34
35~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/haystack/nodes/document_classifier/transformers.py in predict(self, documents)
36 137 batches = self.get_batches(texts, batch_size=self.batch_size)
37 138 if self.task == 'zero-shot-classification':
38--> 139 batched_predictions = [self.model(batch, candidate_labels=self.labels, truncation=True) for batch in batches]
39 140 elif self.task == 'text-classification':
40 141 batched_predictions = [self.model(batch, return_all_scores=self.return_all_scores, truncation=True) for batch in batches]
41
42~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/haystack/nodes/document_classifier/transformers.py in <listcomp>(.0)
43 137 batches = self.get_batches(texts, batch_size=self.batch_size)
44 138 if self.task == 'zero-shot-classification':
45--> 139 batched_predictions = [self.model(batch, candidate_labels=self.labels, truncation=True) for batch in batches]
46 140 elif self.task == 'text-classification':
47 141 batched_predictions = [self.model(batch, return_all_scores=self.return_all_scores, truncation=True) for batch in batches]
48
49~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/pipelines/zero_shot_classification.py in __call__(self, sequences, candidate_labels, hypothesis_template, multi_label, **kwargs)
50 151 sequences = [sequences]
51 152
52--> 153 outputs = super().__call__(sequences, candidate_labels, hypothesis_template)
53 154 num_sequences = len(sequences)
54 155 candidate_labels = self._args_parser._parse_labels(candidate_labels)
55
56~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/pipelines/base.py in __call__(self, *args, **kwargs)
57 758
58 759 def __call__(self, *args, **kwargs):
59--> 760 inputs = self._parse_and_tokenize(*args, **kwargs)
60 761 return self._forward(inputs)
61 762
62
63~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/pipelines/zero_shot_classification.py in _parse_and_tokenize(self, sequences, candidate_labels, hypothesis_template, padding, add_special_tokens, truncation, **kwargs)
64 92 Parse arguments and tokenize only_first so that hypothesis (label) is not truncated
65 93 """
66---> 94 sequence_pairs = self._args_parser(sequences, candidate_labels, hypothesis_template)
67 95 inputs = self.tokenizer(
68 96 sequence_pairs,
69
70~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/pipelines/zero_shot_classification.py in __call__(self, sequences, labels, hypothesis_template)
71 25 def __call__(self, sequences, labels, hypothesis_template):
72 26 if len(labels) == 0 or len(sequences) == 0:
73---> 27 raise ValueError("You must include at least one label and at least one sequence.")
74 28 if hypothesis_template.format(labels[0]) == hypothesis_template:
75 29 raise ValueError(
76
77ValueError: You must include at least one label and at least one sequence.
78
Original Code:
1with open('filt_gri.txt', 'r') as filehandle:
2 tags = [current_place.rstrip() for current_place in filehandle.readlines()]
3
4doc_classifier = TransformersDocumentClassifier(model_name_or_path="cross-encoder/nli-distilroberta-base",
5 task="zero-shot-classification",
6 labels=tags,
7 batch_size=16)
8
9# convert to Document using a fieldmap for custom content fields the classification should run on
10docs_to_classify = [Document.from_dict(d) for d in docs_sliding_window]
11
12# classify using gpu, batch_size makes sure we do not run out of memory
13classified_docs = doc_classifier.predict(docs_to_classify)
14
15# let's see how it looks: there should be a classification result in the meta entry containing labels and scores.
16print(classified_docs[0].to_dict())
17
18all_docs = convert_files_to_dicts(dir_path=doc_dir)
19
20preprocessor_sliding_window = PreProcessor(split_overlap=3,
21 split_length=10,
22 split_respect_sentence_boundary=False,
23 split_by='passage')
24INFO - haystack.modeling.utils - Using devices: CUDA
25INFO - haystack.modeling.utils - Number of GPUs: 1
26---------------------------------------------------------------------------
27ValueError Traceback (most recent call last)
28<ipython-input-11-77eb98038283> in <module>
29 14
30 15 # classify using gpu, batch_size makes sure we do not run out of memory
31---> 16 classified_docs = doc_classifier.predict(docs_to_classify)
32 17
33 18 # let's see how it looks: there should be a classification result in the meta entry containing labels and scores.
34
35~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/haystack/nodes/document_classifier/transformers.py in predict(self, documents)
36 137 batches = self.get_batches(texts, batch_size=self.batch_size)
37 138 if self.task == 'zero-shot-classification':
38--> 139 batched_predictions = [self.model(batch, candidate_labels=self.labels, truncation=True) for batch in batches]
39 140 elif self.task == 'text-classification':
40 141 batched_predictions = [self.model(batch, return_all_scores=self.return_all_scores, truncation=True) for batch in batches]
41
42~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/haystack/nodes/document_classifier/transformers.py in <listcomp>(.0)
43 137 batches = self.get_batches(texts, batch_size=self.batch_size)
44 138 if self.task == 'zero-shot-classification':
45--> 139 batched_predictions = [self.model(batch, candidate_labels=self.labels, truncation=True) for batch in batches]
46 140 elif self.task == 'text-classification':
47 141 batched_predictions = [self.model(batch, return_all_scores=self.return_all_scores, truncation=True) for batch in batches]
48
49~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/pipelines/zero_shot_classification.py in __call__(self, sequences, candidate_labels, hypothesis_template, multi_label, **kwargs)
50 151 sequences = [sequences]
51 152
52--> 153 outputs = super().__call__(sequences, candidate_labels, hypothesis_template)
53 154 num_sequences = len(sequences)
54 155 candidate_labels = self._args_parser._parse_labels(candidate_labels)
55
56~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/pipelines/base.py in __call__(self, *args, **kwargs)
57 758
58 759 def __call__(self, *args, **kwargs):
59--> 760 inputs = self._parse_and_tokenize(*args, **kwargs)
60 761 return self._forward(inputs)
61 762
62
63~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/pipelines/zero_shot_classification.py in _parse_and_tokenize(self, sequences, candidate_labels, hypothesis_template, padding, add_special_tokens, truncation, **kwargs)
64 92 Parse arguments and tokenize only_first so that hypothesis (label) is not truncated
65 93 """
66---> 94 sequence_pairs = self._args_parser(sequences, candidate_labels, hypothesis_template)
67 95 inputs = self.tokenizer(
68 96 sequence_pairs,
69
70~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/pipelines/zero_shot_classification.py in __call__(self, sequences, labels, hypothesis_template)
71 25 def __call__(self, sequences, labels, hypothesis_template):
72 26 if len(labels) == 0 or len(sequences) == 0:
73---> 27 raise ValueError("You must include at least one label and at least one sequence.")
74 28 if hypothesis_template.format(labels[0]) == hypothesis_template:
75 29 raise ValueError(
76
77ValueError: You must include at least one label and at least one sequence.
78doc_classifier = TransformersDocumentClassifier(model_name_or_path="cross-encoder/nli-distilroberta-base",
79 task="zero-shot-classification",
80 labels=["music", "natural language processing", "history"],
81 batch_size=16
82)
83
84# ----------
85
86# convert to Document using a fieldmap for custom content fields the classification should run on
87docs_to_classify = [Document.from_dict(d) for d in docs_sliding_window]
88
89# ----------
90
91# classify using gpu, batch_size makes sure we do not run out of memory
92classified_docs = doc_classifier.predict(docs_to_classify)
93
94# ----------
95
96# let's see how it looks: there should be a classification result in the meta entry containing labels and scores.
97print(classified_docs[0].to_dict())
98
Please let me know if there is anything else I should add to post/ clarify.
ANSWER
Answered 2021-Dec-08 at 21:05 Reading official docs and analyzing that the error is generated when calling .predict(docs_to_classify)
I could recommend that you try to do basic tests such as using the parameter labels = ["negative", "positive"]
, and correct if it is caused by string values of the external file and optionally you should also check where it indicates the use of pipelines.
1with open('filt_gri.txt', 'r') as filehandle:
2 tags = [current_place.rstrip() for current_place in filehandle.readlines()]
3
4doc_classifier = TransformersDocumentClassifier(model_name_or_path="cross-encoder/nli-distilroberta-base",
5 task="zero-shot-classification",
6 labels=tags,
7 batch_size=16)
8
9# convert to Document using a fieldmap for custom content fields the classification should run on
10docs_to_classify = [Document.from_dict(d) for d in docs_sliding_window]
11
12# classify using gpu, batch_size makes sure we do not run out of memory
13classified_docs = doc_classifier.predict(docs_to_classify)
14
15# let's see how it looks: there should be a classification result in the meta entry containing labels and scores.
16print(classified_docs[0].to_dict())
17
18all_docs = convert_files_to_dicts(dir_path=doc_dir)
19
20preprocessor_sliding_window = PreProcessor(split_overlap=3,
21 split_length=10,
22 split_respect_sentence_boundary=False,
23 split_by='passage')
24INFO - haystack.modeling.utils - Using devices: CUDA
25INFO - haystack.modeling.utils - Number of GPUs: 1
26---------------------------------------------------------------------------
27ValueError Traceback (most recent call last)
28<ipython-input-11-77eb98038283> in <module>
29 14
30 15 # classify using gpu, batch_size makes sure we do not run out of memory
31---> 16 classified_docs = doc_classifier.predict(docs_to_classify)
32 17
33 18 # let's see how it looks: there should be a classification result in the meta entry containing labels and scores.
34
35~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/haystack/nodes/document_classifier/transformers.py in predict(self, documents)
36 137 batches = self.get_batches(texts, batch_size=self.batch_size)
37 138 if self.task == 'zero-shot-classification':
38--> 139 batched_predictions = [self.model(batch, candidate_labels=self.labels, truncation=True) for batch in batches]
39 140 elif self.task == 'text-classification':
40 141 batched_predictions = [self.model(batch, return_all_scores=self.return_all_scores, truncation=True) for batch in batches]
41
42~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/haystack/nodes/document_classifier/transformers.py in <listcomp>(.0)
43 137 batches = self.get_batches(texts, batch_size=self.batch_size)
44 138 if self.task == 'zero-shot-classification':
45--> 139 batched_predictions = [self.model(batch, candidate_labels=self.labels, truncation=True) for batch in batches]
46 140 elif self.task == 'text-classification':
47 141 batched_predictions = [self.model(batch, return_all_scores=self.return_all_scores, truncation=True) for batch in batches]
48
49~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/pipelines/zero_shot_classification.py in __call__(self, sequences, candidate_labels, hypothesis_template, multi_label, **kwargs)
50 151 sequences = [sequences]
51 152
52--> 153 outputs = super().__call__(sequences, candidate_labels, hypothesis_template)
53 154 num_sequences = len(sequences)
54 155 candidate_labels = self._args_parser._parse_labels(candidate_labels)
55
56~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/pipelines/base.py in __call__(self, *args, **kwargs)
57 758
58 759 def __call__(self, *args, **kwargs):
59--> 760 inputs = self._parse_and_tokenize(*args, **kwargs)
60 761 return self._forward(inputs)
61 762
62
63~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/pipelines/zero_shot_classification.py in _parse_and_tokenize(self, sequences, candidate_labels, hypothesis_template, padding, add_special_tokens, truncation, **kwargs)
64 92 Parse arguments and tokenize only_first so that hypothesis (label) is not truncated
65 93 """
66---> 94 sequence_pairs = self._args_parser(sequences, candidate_labels, hypothesis_template)
67 95 inputs = self.tokenizer(
68 96 sequence_pairs,
69
70~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/pipelines/zero_shot_classification.py in __call__(self, sequences, labels, hypothesis_template)
71 25 def __call__(self, sequences, labels, hypothesis_template):
72 26 if len(labels) == 0 or len(sequences) == 0:
73---> 27 raise ValueError("You must include at least one label and at least one sequence.")
74 28 if hypothesis_template.format(labels[0]) == hypothesis_template:
75 29 raise ValueError(
76
77ValueError: You must include at least one label and at least one sequence.
78doc_classifier = TransformersDocumentClassifier(model_name_or_path="cross-encoder/nli-distilroberta-base",
79 task="zero-shot-classification",
80 labels=["music", "natural language processing", "history"],
81 batch_size=16
82)
83
84# ----------
85
86# convert to Document using a fieldmap for custom content fields the classification should run on
87docs_to_classify = [Document.from_dict(d) for d in docs_sliding_window]
88
89# ----------
90
91# classify using gpu, batch_size makes sure we do not run out of memory
92classified_docs = doc_classifier.predict(docs_to_classify)
93
94# ----------
95
96# let's see how it looks: there should be a classification result in the meta entry containing labels and scores.
97print(classified_docs[0].to_dict())
98pipeline = Pipeline()
99pipeline.add_node(component=retriever, name="Retriever", inputs=["Query"])
100pipeline.add_node(component=doc_classifier, name='DocClassifier', inputs=['Retriever'])
101
QUESTION
RuntimeError: CUDA out of memory | Elastic Search
Asked 2021-Dec-09 at 11:53I'm fairly new to Machine Learning. I've successfully solved errors to do with parameters and model setup.
I'm using this Notebook, where section Apply DocumentClassifier is altered as below.
Jupyter Labs, kernel: conda_mxnet_latest_p37
.
Error seems to be more about my laptop's hardware, rather than my code being broken.
Update: I changed batch_size=4
, it ran for ages only to crash.
What should be my standard approach to solving this error?
My Code:
1with open('filt_gri.txt', 'r') as filehandle:
2 tags = [current_place.rstrip() for current_place in filehandle.readlines()]
3
4doc_classifier = TransformersDocumentClassifier(model_name_or_path="cross-encoder/nli-distilroberta-base",
5 task="zero-shot-classification",
6 labels=tags,
7 batch_size=4)
8
9# convert to Document using a fieldmap for custom content fields the classification should run on
10docs_to_classify = [Document.from_dict(d) for d in docs_sliding_window]
11
12# classify using gpu, batch_size makes sure we do not run out of memory
13classified_docs = doc_classifier.predict(docs_to_classify)
14
15# let's see how it looks: there should be a classification result in the meta entry containing labels and scores.
16print(classified_docs[0].to_dict())
17
18all_docs = convert_files_to_dicts(dir_path=doc_dir)
19
20preprocessor_sliding_window = PreProcessor(split_overlap=3,
21 split_length=10,
22 split_respect_sentence_boundary=False,
23 split_by='passage')
24
Error:
1with open('filt_gri.txt', 'r') as filehandle:
2 tags = [current_place.rstrip() for current_place in filehandle.readlines()]
3
4doc_classifier = TransformersDocumentClassifier(model_name_or_path="cross-encoder/nli-distilroberta-base",
5 task="zero-shot-classification",
6 labels=tags,
7 batch_size=4)
8
9# convert to Document using a fieldmap for custom content fields the classification should run on
10docs_to_classify = [Document.from_dict(d) for d in docs_sliding_window]
11
12# classify using gpu, batch_size makes sure we do not run out of memory
13classified_docs = doc_classifier.predict(docs_to_classify)
14
15# let's see how it looks: there should be a classification result in the meta entry containing labels and scores.
16print(classified_docs[0].to_dict())
17
18all_docs = convert_files_to_dicts(dir_path=doc_dir)
19
20preprocessor_sliding_window = PreProcessor(split_overlap=3,
21 split_length=10,
22 split_respect_sentence_boundary=False,
23 split_by='passage')
24INFO - haystack.modeling.utils - Using devices: CUDA
25INFO - haystack.modeling.utils - Using devices: CUDA
26INFO - haystack.modeling.utils - Number of GPUs: 1
27INFO - haystack.modeling.utils - Number of GPUs: 1
28---------------------------------------------------------------------------
29RuntimeError Traceback (most recent call last)
30<ipython-input-25-27dfca549a7d> in <module>
31 14
32 15 # classify using gpu, batch_size makes sure we do not run out of memory
33---> 16 classified_docs = doc_classifier.predict(docs_to_classify)
34 17
35 18 # let's see how it looks: there should be a classification result in the meta entry containing labels and scores.
36
37~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/haystack/nodes/document_classifier/transformers.py in predict(self, documents)
38 137 batches = self.get_batches(texts, batch_size=self.batch_size)
39 138 if self.task == 'zero-shot-classification':
40--> 139 batched_predictions = [self.model(batch, candidate_labels=self.labels, truncation=True) for batch in batches]
41 140 elif self.task == 'text-classification':
42 141 batched_predictions = [self.model(batch, return_all_scores=self.return_all_scores, truncation=True) for batch in batches]
43
44~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/haystack/nodes/document_classifier/transformers.py in <listcomp>(.0)
45 137 batches = self.get_batches(texts, batch_size=self.batch_size)
46 138 if self.task == 'zero-shot-classification':
47--> 139 batched_predictions = [self.model(batch, candidate_labels=self.labels, truncation=True) for batch in batches]
48 140 elif self.task == 'text-classification':
49 141 batched_predictions = [self.model(batch, return_all_scores=self.return_all_scores, truncation=True) for batch in batches]
50
51~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/pipelines/zero_shot_classification.py in __call__(self, sequences, candidate_labels, hypothesis_template, multi_label, **kwargs)
52 151 sequences = [sequences]
53 152
54--> 153 outputs = super().__call__(sequences, candidate_labels, hypothesis_template)
55 154 num_sequences = len(sequences)
56 155 candidate_labels = self._args_parser._parse_labels(candidate_labels)
57
58~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/pipelines/base.py in __call__(self, *args, **kwargs)
59 759 def __call__(self, *args, **kwargs):
60 760 inputs = self._parse_and_tokenize(*args, **kwargs)
61--> 761 return self._forward(inputs)
62 762
63 763 def _forward(self, inputs, return_tensors=False):
64
65~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/pipelines/base.py in _forward(self, inputs, return_tensors)
66 780 with torch.no_grad():
67 781 inputs = self.ensure_tensor_on_device(**inputs)
68--> 782 predictions = self.model(**inputs)[0].cpu()
69 783
70 784 if return_tensors:
71
72~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
73 1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
74 1101 or _global_forward_hooks or _global_forward_pre_hooks):
75-> 1102 return forward_call(*input, **kwargs)
76 1103 # Do not call functions when jit is used
77 1104 full_backward_hooks, non_full_backward_hooks = [], []
78
79~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, labels, output_attentions, output_hidden_states, return_dict)
80 1162 output_attentions=output_attentions,
81 1163 output_hidden_states=output_hidden_states,
82-> 1164 return_dict=return_dict,
83 1165 )
84 1166 sequence_output = outputs[0]
85
86~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
87 1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
88 1101 or _global_forward_hooks or _global_forward_pre_hooks):
89-> 1102 return forward_call(*input, **kwargs)
90 1103 # Do not call functions when jit is used
91 1104 full_backward_hooks, non_full_backward_hooks = [], []
92
93~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
94 823 output_attentions=output_attentions,
95 824 output_hidden_states=output_hidden_states,
96--> 825 return_dict=return_dict,
97 826 )
98 827 sequence_output = encoder_outputs[0]
99
100~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
101 1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
102 1101 or _global_forward_hooks or _global_forward_pre_hooks):
103-> 1102 return forward_call(*input, **kwargs)
104 1103 # Do not call functions when jit is used
105 1104 full_backward_hooks, non_full_backward_hooks = [], []
106
107~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
108 513 encoder_attention_mask,
109 514 past_key_value,
110--> 515 output_attentions,
111 516 )
112 517
113
114~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
115 1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
116 1101 or _global_forward_hooks or _global_forward_pre_hooks):
117-> 1102 return forward_call(*input, **kwargs)
118 1103 # Do not call functions when jit is used
119 1104 full_backward_hooks, non_full_backward_hooks = [], []
120
121~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_value, output_attentions)
122 398 head_mask,
123 399 output_attentions=output_attentions,
124--> 400 past_key_value=self_attn_past_key_value,
125 401 )
126 402 attention_output = self_attention_outputs[0]
127
128~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
129 1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
130 1101 or _global_forward_hooks or _global_forward_pre_hooks):
131-> 1102 return forward_call(*input, **kwargs)
132 1103 # Do not call functions when jit is used
133 1104 full_backward_hooks, non_full_backward_hooks = [], []
134
135~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_value, output_attentions)
136 328 encoder_attention_mask,
137 329 past_key_value,
138--> 330 output_attentions,
139 331 )
140 332 attention_output = self.output(self_outputs[0], hidden_states)
141
142~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
143 1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
144 1101 or _global_forward_hooks or _global_forward_pre_hooks):
145-> 1102 return forward_call(*input, **kwargs)
146 1103 # Do not call functions when jit is used
147 1104 full_backward_hooks, non_full_backward_hooks = [], []
148
149~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_value, output_attentions)
150 241 attention_scores = attention_scores + relative_position_scores_query + relative_position_scores_key
151 242
152--> 243 attention_scores = attention_scores / math.sqrt(self.attention_head_size)
153 244 if attention_mask is not None:
154 245 # Apply the attention mask is (precomputed for all layers in RobertaModel forward() function)
155
156RuntimeError: CUDA out of memory. Tried to allocate 3.60 GiB (GPU 0; 14.76 GiB total capacity; 7.33 GiB already allocated; 1.37 GiB free; 12.29 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
157---------------------------------------------------------------------------
158RuntimeError Traceback (most recent call last)
159<ipython-input-25-27dfca549a7d> in <module>
160 14
161 15 # classify using gpu, batch_size makes sure we do not run out of memory
162---> 16 classified_docs = doc_classifier.predict(docs_to_classify)
163 17
164 18 # let's see how it looks: there should be a classification result in the meta entry containing labels and scores.
165
166~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/haystack/nodes/document_classifier/transformers.py in predict(self, documents)
167 137 batches = self.get_batches(texts, batch_size=self.batch_size)
168 138 if self.task == 'zero-shot-classification':
169--> 139 batched_predictions = [self.model(batch, candidate_labels=self.labels, truncation=True) for batch in batches]
170 140 elif self.task == 'text-classification':
171 141 batched_predictions = [self.model(batch, return_all_scores=self.return_all_scores, truncation=True) for batch in batches]
172
173~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/haystack/nodes/document_classifier/transformers.py in <listcomp>(.0)
174 137 batches = self.get_batches(texts, batch_size=self.batch_size)
175 138 if self.task == 'zero-shot-classification':
176--> 139 batched_predictions = [self.model(batch, candidate_labels=self.labels, truncation=True) for batch in batches]
177 140 elif self.task == 'text-classification':
178 141 batched_predictions = [self.model(batch, return_all_scores=self.return_all_scores, truncation=True) for batch in batches]
179
180~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/pipelines/zero_shot_classification.py in __call__(self, sequences, candidate_labels, hypothesis_template, multi_label, **kwargs)
181 151 sequences = [sequences]
182 152
183--> 153 outputs = super().__call__(sequences, candidate_labels, hypothesis_template)
184 154 num_sequences = len(sequences)
185 155 candidate_labels = self._args_parser._parse_labels(candidate_labels)
186
187~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/pipelines/base.py in __call__(self, *args, **kwargs)
188 759 def __call__(self, *args, **kwargs):
189 760 inputs = self._parse_and_tokenize(*args, **kwargs)
190--> 761 return self._forward(inputs)
191 762
192 763 def _forward(self, inputs, return_tensors=False):
193
194~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/pipelines/base.py in _forward(self, inputs, return_tensors)
195 780 with torch.no_grad():
196 781 inputs = self.ensure_tensor_on_device(**inputs)
197--> 782 predictions = self.model(**inputs)[0].cpu()
198 783
199 784 if return_tensors:
200
201~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
202 1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
203 1101 or _global_forward_hooks or _global_forward_pre_hooks):
204-> 1102 return forward_call(*input, **kwargs)
205 1103 # Do not call functions when jit is used
206 1104 full_backward_hooks, non_full_backward_hooks = [], []
207
208~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, labels, output_attentions, output_hidden_states, return_dict)
209 1162 output_attentions=output_attentions,
210 1163 output_hidden_states=output_hidden_states,
211-> 1164 return_dict=return_dict,
212 1165 )
213 1166 sequence_output = outputs[0]
214
215~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
216 1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
217 1101 or _global_forward_hooks or _global_forward_pre_hooks):
218-> 1102 return forward_call(*input, **kwargs)
219 1103 # Do not call functions when jit is used
220 1104 full_backward_hooks, non_full_backward_hooks = [], []
221
222~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
223 823 output_attentions=output_attentions,
224 824 output_hidden_states=output_hidden_states,
225--> 825 return_dict=return_dict,
226 826 )
227 827 sequence_output = encoder_outputs[0]
228
229~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
230 1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
231 1101 or _global_forward_hooks or _global_forward_pre_hooks):
232-> 1102 return forward_call(*input, **kwargs)
233 1103 # Do not call functions when jit is used
234 1104 full_backward_hooks, non_full_backward_hooks = [], []
235
236~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
237 513 encoder_attention_mask,
238 514 past_key_value,
239--> 515 output_attentions,
240 516 )
241 517
242
243~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
244 1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
245 1101 or _global_forward_hooks or _global_forward_pre_hooks):
246-> 1102 return forward_call(*input, **kwargs)
247 1103 # Do not call functions when jit is used
248 1104 full_backward_hooks, non_full_backward_hooks = [], []
249
250~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_value, output_attentions)
251 398 head_mask,
252 399 output_attentions=output_attentions,
253--> 400 past_key_value=self_attn_past_key_value,
254 401 )
255 402 attention_output = self_attention_outputs[0]
256
257~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
258 1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
259 1101 or _global_forward_hooks or _global_forward_pre_hooks):
260-> 1102 return forward_call(*input, **kwargs)
261 1103 # Do not call functions when jit is used
262 1104 full_backward_hooks, non_full_backward_hooks = [], []
263
264~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_value, output_attentions)
265 328 encoder_attention_mask,
266 329 past_key_value,
267--> 330 output_attentions,
268 331 )
269 332 attention_output = self.output(self_outputs[0], hidden_states)
270
271~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
272 1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
273 1101 or _global_forward_hooks or _global_forward_pre_hooks):
274-> 1102 return forward_call(*input, **kwargs)
275 1103 # Do not call functions when jit is used
276 1104 full_backward_hooks, non_full_backward_hooks = [], []
277
278~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_value, output_attentions)
279 241 attention_scores = attention_scores + relative_position_scores_query + relative_position_scores_key
280 242
281--> 243 attention_scores = attention_scores / math.sqrt(self.attention_head_size)
282 244 if attention_mask is not None:
283 245 # Apply the attention mask is (precomputed for all layers in RobertaModel forward() function)
284
285RuntimeError: CUDA out of memory. Tried to allocate 3.60 GiB (GPU 0; 14.76 GiB total capacity; 7.33 GiB already allocated; 1.37 GiB free; 12.29 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
286
Original Code:
1with open('filt_gri.txt', 'r') as filehandle:
2 tags = [current_place.rstrip() for current_place in filehandle.readlines()]
3
4doc_classifier = TransformersDocumentClassifier(model_name_or_path="cross-encoder/nli-distilroberta-base",
5 task="zero-shot-classification",
6 labels=tags,
7 batch_size=4)
8
9# convert to Document using a fieldmap for custom content fields the classification should run on
10docs_to_classify = [Document.from_dict(d) for d in docs_sliding_window]
11
12# classify using gpu, batch_size makes sure we do not run out of memory
13classified_docs = doc_classifier.predict(docs_to_classify)
14
15# let's see how it looks: there should be a classification result in the meta entry containing labels and scores.
16print(classified_docs[0].to_dict())
17
18all_docs = convert_files_to_dicts(dir_path=doc_dir)
19
20preprocessor_sliding_window = PreProcessor(split_overlap=3,
21 split_length=10,
22 split_respect_sentence_boundary=False,
23 split_by='passage')
24INFO - haystack.modeling.utils - Using devices: CUDA
25INFO - haystack.modeling.utils - Using devices: CUDA
26INFO - haystack.modeling.utils - Number of GPUs: 1
27INFO - haystack.modeling.utils - Number of GPUs: 1
28---------------------------------------------------------------------------
29RuntimeError Traceback (most recent call last)
30<ipython-input-25-27dfca549a7d> in <module>
31 14
32 15 # classify using gpu, batch_size makes sure we do not run out of memory
33---> 16 classified_docs = doc_classifier.predict(docs_to_classify)
34 17
35 18 # let's see how it looks: there should be a classification result in the meta entry containing labels and scores.
36
37~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/haystack/nodes/document_classifier/transformers.py in predict(self, documents)
38 137 batches = self.get_batches(texts, batch_size=self.batch_size)
39 138 if self.task == 'zero-shot-classification':
40--> 139 batched_predictions = [self.model(batch, candidate_labels=self.labels, truncation=True) for batch in batches]
41 140 elif self.task == 'text-classification':
42 141 batched_predictions = [self.model(batch, return_all_scores=self.return_all_scores, truncation=True) for batch in batches]
43
44~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/haystack/nodes/document_classifier/transformers.py in <listcomp>(.0)
45 137 batches = self.get_batches(texts, batch_size=self.batch_size)
46 138 if self.task == 'zero-shot-classification':
47--> 139 batched_predictions = [self.model(batch, candidate_labels=self.labels, truncation=True) for batch in batches]
48 140 elif self.task == 'text-classification':
49 141 batched_predictions = [self.model(batch, return_all_scores=self.return_all_scores, truncation=True) for batch in batches]
50
51~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/pipelines/zero_shot_classification.py in __call__(self, sequences, candidate_labels, hypothesis_template, multi_label, **kwargs)
52 151 sequences = [sequences]
53 152
54--> 153 outputs = super().__call__(sequences, candidate_labels, hypothesis_template)
55 154 num_sequences = len(sequences)
56 155 candidate_labels = self._args_parser._parse_labels(candidate_labels)
57
58~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/pipelines/base.py in __call__(self, *args, **kwargs)
59 759 def __call__(self, *args, **kwargs):
60 760 inputs = self._parse_and_tokenize(*args, **kwargs)
61--> 761 return self._forward(inputs)
62 762
63 763 def _forward(self, inputs, return_tensors=False):
64
65~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/pipelines/base.py in _forward(self, inputs, return_tensors)
66 780 with torch.no_grad():
67 781 inputs = self.ensure_tensor_on_device(**inputs)
68--> 782 predictions = self.model(**inputs)[0].cpu()
69 783
70 784 if return_tensors:
71
72~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
73 1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
74 1101 or _global_forward_hooks or _global_forward_pre_hooks):
75-> 1102 return forward_call(*input, **kwargs)
76 1103 # Do not call functions when jit is used
77 1104 full_backward_hooks, non_full_backward_hooks = [], []
78
79~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, labels, output_attentions, output_hidden_states, return_dict)
80 1162 output_attentions=output_attentions,
81 1163 output_hidden_states=output_hidden_states,
82-> 1164 return_dict=return_dict,
83 1165 )
84 1166 sequence_output = outputs[0]
85
86~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
87 1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
88 1101 or _global_forward_hooks or _global_forward_pre_hooks):
89-> 1102 return forward_call(*input, **kwargs)
90 1103 # Do not call functions when jit is used
91 1104 full_backward_hooks, non_full_backward_hooks = [], []
92
93~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
94 823 output_attentions=output_attentions,
95 824 output_hidden_states=output_hidden_states,
96--> 825 return_dict=return_dict,
97 826 )
98 827 sequence_output = encoder_outputs[0]
99
100~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
101 1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
102 1101 or _global_forward_hooks or _global_forward_pre_hooks):
103-> 1102 return forward_call(*input, **kwargs)
104 1103 # Do not call functions when jit is used
105 1104 full_backward_hooks, non_full_backward_hooks = [], []
106
107~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
108 513 encoder_attention_mask,
109 514 past_key_value,
110--> 515 output_attentions,
111 516 )
112 517
113
114~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
115 1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
116 1101 or _global_forward_hooks or _global_forward_pre_hooks):
117-> 1102 return forward_call(*input, **kwargs)
118 1103 # Do not call functions when jit is used
119 1104 full_backward_hooks, non_full_backward_hooks = [], []
120
121~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_value, output_attentions)
122 398 head_mask,
123 399 output_attentions=output_attentions,
124--> 400 past_key_value=self_attn_past_key_value,
125 401 )
126 402 attention_output = self_attention_outputs[0]
127
128~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
129 1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
130 1101 or _global_forward_hooks or _global_forward_pre_hooks):
131-> 1102 return forward_call(*input, **kwargs)
132 1103 # Do not call functions when jit is used
133 1104 full_backward_hooks, non_full_backward_hooks = [], []
134
135~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_value, output_attentions)
136 328 encoder_attention_mask,
137 329 past_key_value,
138--> 330 output_attentions,
139 331 )
140 332 attention_output = self.output(self_outputs[0], hidden_states)
141
142~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
143 1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
144 1101 or _global_forward_hooks or _global_forward_pre_hooks):
145-> 1102 return forward_call(*input, **kwargs)
146 1103 # Do not call functions when jit is used
147 1104 full_backward_hooks, non_full_backward_hooks = [], []
148
149~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_value, output_attentions)
150 241 attention_scores = attention_scores + relative_position_scores_query + relative_position_scores_key
151 242
152--> 243 attention_scores = attention_scores / math.sqrt(self.attention_head_size)
153 244 if attention_mask is not None:
154 245 # Apply the attention mask is (precomputed for all layers in RobertaModel forward() function)
155
156RuntimeError: CUDA out of memory. Tried to allocate 3.60 GiB (GPU 0; 14.76 GiB total capacity; 7.33 GiB already allocated; 1.37 GiB free; 12.29 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
157---------------------------------------------------------------------------
158RuntimeError Traceback (most recent call last)
159<ipython-input-25-27dfca549a7d> in <module>
160 14
161 15 # classify using gpu, batch_size makes sure we do not run out of memory
162---> 16 classified_docs = doc_classifier.predict(docs_to_classify)
163 17
164 18 # let's see how it looks: there should be a classification result in the meta entry containing labels and scores.
165
166~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/haystack/nodes/document_classifier/transformers.py in predict(self, documents)
167 137 batches = self.get_batches(texts, batch_size=self.batch_size)
168 138 if self.task == 'zero-shot-classification':
169--> 139 batched_predictions = [self.model(batch, candidate_labels=self.labels, truncation=True) for batch in batches]
170 140 elif self.task == 'text-classification':
171 141 batched_predictions = [self.model(batch, return_all_scores=self.return_all_scores, truncation=True) for batch in batches]
172
173~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/haystack/nodes/document_classifier/transformers.py in <listcomp>(.0)
174 137 batches = self.get_batches(texts, batch_size=self.batch_size)
175 138 if self.task == 'zero-shot-classification':
176--> 139 batched_predictions = [self.model(batch, candidate_labels=self.labels, truncation=True) for batch in batches]
177 140 elif self.task == 'text-classification':
178 141 batched_predictions = [self.model(batch, return_all_scores=self.return_all_scores, truncation=True) for batch in batches]
179
180~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/pipelines/zero_shot_classification.py in __call__(self, sequences, candidate_labels, hypothesis_template, multi_label, **kwargs)
181 151 sequences = [sequences]
182 152
183--> 153 outputs = super().__call__(sequences, candidate_labels, hypothesis_template)
184 154 num_sequences = len(sequences)
185 155 candidate_labels = self._args_parser._parse_labels(candidate_labels)
186
187~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/pipelines/base.py in __call__(self, *args, **kwargs)
188 759 def __call__(self, *args, **kwargs):
189 760 inputs = self._parse_and_tokenize(*args, **kwargs)
190--> 761 return self._forward(inputs)
191 762
192 763 def _forward(self, inputs, return_tensors=False):
193
194~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/pipelines/base.py in _forward(self, inputs, return_tensors)
195 780 with torch.no_grad():
196 781 inputs = self.ensure_tensor_on_device(**inputs)
197--> 782 predictions = self.model(**inputs)[0].cpu()
198 783
199 784 if return_tensors:
200
201~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
202 1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
203 1101 or _global_forward_hooks or _global_forward_pre_hooks):
204-> 1102 return forward_call(*input, **kwargs)
205 1103 # Do not call functions when jit is used
206 1104 full_backward_hooks, non_full_backward_hooks = [], []
207
208~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, labels, output_attentions, output_hidden_states, return_dict)
209 1162 output_attentions=output_attentions,
210 1163 output_hidden_states=output_hidden_states,
211-> 1164 return_dict=return_dict,
212 1165 )
213 1166 sequence_output = outputs[0]
214
215~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
216 1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
217 1101 or _global_forward_hooks or _global_forward_pre_hooks):
218-> 1102 return forward_call(*input, **kwargs)
219 1103 # Do not call functions when jit is used
220 1104 full_backward_hooks, non_full_backward_hooks = [], []
221
222~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
223 823 output_attentions=output_attentions,
224 824 output_hidden_states=output_hidden_states,
225--> 825 return_dict=return_dict,
226 826 )
227 827 sequence_output = encoder_outputs[0]
228
229~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
230 1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
231 1101 or _global_forward_hooks or _global_forward_pre_hooks):
232-> 1102 return forward_call(*input, **kwargs)
233 1103 # Do not call functions when jit is used
234 1104 full_backward_hooks, non_full_backward_hooks = [], []
235
236~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
237 513 encoder_attention_mask,
238 514 past_key_value,
239--> 515 output_attentions,
240 516 )
241 517
242
243~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
244 1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
245 1101 or _global_forward_hooks or _global_forward_pre_hooks):
246-> 1102 return forward_call(*input, **kwargs)
247 1103 # Do not call functions when jit is used
248 1104 full_backward_hooks, non_full_backward_hooks = [], []
249
250~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_value, output_attentions)
251 398 head_mask,
252 399 output_attentions=output_attentions,
253--> 400 past_key_value=self_attn_past_key_value,
254 401 )
255 402 attention_output = self_attention_outputs[0]
256
257~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
258 1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
259 1101 or _global_forward_hooks or _global_forward_pre_hooks):
260-> 1102 return forward_call(*input, **kwargs)
261 1103 # Do not call functions when jit is used
262 1104 full_backward_hooks, non_full_backward_hooks = [], []
263
264~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_value, output_attentions)
265 328 encoder_attention_mask,
266 329 past_key_value,
267--> 330 output_attentions,
268 331 )
269 332 attention_output = self.output(self_outputs[0], hidden_states)
270
271~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
272 1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
273 1101 or _global_forward_hooks or _global_forward_pre_hooks):
274-> 1102 return forward_call(*input, **kwargs)
275 1103 # Do not call functions when jit is used
276 1104 full_backward_hooks, non_full_backward_hooks = [], []
277
278~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_value, output_attentions)
279 241 attention_scores = attention_scores + relative_position_scores_query + relative_position_scores_key
280 242
281--> 243 attention_scores = attention_scores / math.sqrt(self.attention_head_size)
282 244 if attention_mask is not None:
283 245 # Apply the attention mask is (precomputed for all layers in RobertaModel forward() function)
284
285RuntimeError: CUDA out of memory. Tried to allocate 3.60 GiB (GPU 0; 14.76 GiB total capacity; 7.33 GiB already allocated; 1.37 GiB free; 12.29 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
286doc_classifier = TransformersDocumentClassifier(model_name_or_path="cross-encoder/nli-distilroberta-base",
287 task="zero-shot-classification",
288 labels=["music", "natural language processing", "history"],
289 batch_size=16
290)
291
292# ----------
293
294# convert to Document using a fieldmap for custom content fields the classification should run on
295docs_to_classify = [Document.from_dict(d) for d in docs_sliding_window]
296
297# ----------
298
299# classify using gpu, batch_size makes sure we do not run out of memory
300classified_docs = doc_classifier.predict(docs_to_classify)
301
302# ----------
303
304# let's see how it looks: there should be a classification result in the meta entry containing labels and scores.
305print(classified_docs[0].to_dict())
306
Please let me know if there is anything else I should add to post/ clarify.
ANSWER
Answered 2021-Dec-09 at 11:53Reducing the batch_size
helped me:
1with open('filt_gri.txt', 'r') as filehandle:
2 tags = [current_place.rstrip() for current_place in filehandle.readlines()]
3
4doc_classifier = TransformersDocumentClassifier(model_name_or_path="cross-encoder/nli-distilroberta-base",
5 task="zero-shot-classification",
6 labels=tags,
7 batch_size=4)
8
9# convert to Document using a fieldmap for custom content fields the classification should run on
10docs_to_classify = [Document.from_dict(d) for d in docs_sliding_window]
11
12# classify using gpu, batch_size makes sure we do not run out of memory
13classified_docs = doc_classifier.predict(docs_to_classify)
14
15# let's see how it looks: there should be a classification result in the meta entry containing labels and scores.
16print(classified_docs[0].to_dict())
17
18all_docs = convert_files_to_dicts(dir_path=doc_dir)
19
20preprocessor_sliding_window = PreProcessor(split_overlap=3,
21 split_length=10,
22 split_respect_sentence_boundary=False,
23 split_by='passage')
24INFO - haystack.modeling.utils - Using devices: CUDA
25INFO - haystack.modeling.utils - Using devices: CUDA
26INFO - haystack.modeling.utils - Number of GPUs: 1
27INFO - haystack.modeling.utils - Number of GPUs: 1
28---------------------------------------------------------------------------
29RuntimeError Traceback (most recent call last)
30<ipython-input-25-27dfca549a7d> in <module>
31 14
32 15 # classify using gpu, batch_size makes sure we do not run out of memory
33---> 16 classified_docs = doc_classifier.predict(docs_to_classify)
34 17
35 18 # let's see how it looks: there should be a classification result in the meta entry containing labels and scores.
36
37~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/haystack/nodes/document_classifier/transformers.py in predict(self, documents)
38 137 batches = self.get_batches(texts, batch_size=self.batch_size)
39 138 if self.task == 'zero-shot-classification':
40--> 139 batched_predictions = [self.model(batch, candidate_labels=self.labels, truncation=True) for batch in batches]
41 140 elif self.task == 'text-classification':
42 141 batched_predictions = [self.model(batch, return_all_scores=self.return_all_scores, truncation=True) for batch in batches]
43
44~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/haystack/nodes/document_classifier/transformers.py in <listcomp>(.0)
45 137 batches = self.get_batches(texts, batch_size=self.batch_size)
46 138 if self.task == 'zero-shot-classification':
47--> 139 batched_predictions = [self.model(batch, candidate_labels=self.labels, truncation=True) for batch in batches]
48 140 elif self.task == 'text-classification':
49 141 batched_predictions = [self.model(batch, return_all_scores=self.return_all_scores, truncation=True) for batch in batches]
50
51~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/pipelines/zero_shot_classification.py in __call__(self, sequences, candidate_labels, hypothesis_template, multi_label, **kwargs)
52 151 sequences = [sequences]
53 152
54--> 153 outputs = super().__call__(sequences, candidate_labels, hypothesis_template)
55 154 num_sequences = len(sequences)
56 155 candidate_labels = self._args_parser._parse_labels(candidate_labels)
57
58~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/pipelines/base.py in __call__(self, *args, **kwargs)
59 759 def __call__(self, *args, **kwargs):
60 760 inputs = self._parse_and_tokenize(*args, **kwargs)
61--> 761 return self._forward(inputs)
62 762
63 763 def _forward(self, inputs, return_tensors=False):
64
65~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/pipelines/base.py in _forward(self, inputs, return_tensors)
66 780 with torch.no_grad():
67 781 inputs = self.ensure_tensor_on_device(**inputs)
68--> 782 predictions = self.model(**inputs)[0].cpu()
69 783
70 784 if return_tensors:
71
72~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
73 1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
74 1101 or _global_forward_hooks or _global_forward_pre_hooks):
75-> 1102 return forward_call(*input, **kwargs)
76 1103 # Do not call functions when jit is used
77 1104 full_backward_hooks, non_full_backward_hooks = [], []
78
79~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, labels, output_attentions, output_hidden_states, return_dict)
80 1162 output_attentions=output_attentions,
81 1163 output_hidden_states=output_hidden_states,
82-> 1164 return_dict=return_dict,
83 1165 )
84 1166 sequence_output = outputs[0]
85
86~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
87 1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
88 1101 or _global_forward_hooks or _global_forward_pre_hooks):
89-> 1102 return forward_call(*input, **kwargs)
90 1103 # Do not call functions when jit is used
91 1104 full_backward_hooks, non_full_backward_hooks = [], []
92
93~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
94 823 output_attentions=output_attentions,
95 824 output_hidden_states=output_hidden_states,
96--> 825 return_dict=return_dict,
97 826 )
98 827 sequence_output = encoder_outputs[0]
99
100~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
101 1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
102 1101 or _global_forward_hooks or _global_forward_pre_hooks):
103-> 1102 return forward_call(*input, **kwargs)
104 1103 # Do not call functions when jit is used
105 1104 full_backward_hooks, non_full_backward_hooks = [], []
106
107~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
108 513 encoder_attention_mask,
109 514 past_key_value,
110--> 515 output_attentions,
111 516 )
112 517
113
114~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
115 1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
116 1101 or _global_forward_hooks or _global_forward_pre_hooks):
117-> 1102 return forward_call(*input, **kwargs)
118 1103 # Do not call functions when jit is used
119 1104 full_backward_hooks, non_full_backward_hooks = [], []
120
121~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_value, output_attentions)
122 398 head_mask,
123 399 output_attentions=output_attentions,
124--> 400 past_key_value=self_attn_past_key_value,
125 401 )
126 402 attention_output = self_attention_outputs[0]
127
128~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
129 1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
130 1101 or _global_forward_hooks or _global_forward_pre_hooks):
131-> 1102 return forward_call(*input, **kwargs)
132 1103 # Do not call functions when jit is used
133 1104 full_backward_hooks, non_full_backward_hooks = [], []
134
135~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_value, output_attentions)
136 328 encoder_attention_mask,
137 329 past_key_value,
138--> 330 output_attentions,
139 331 )
140 332 attention_output = self.output(self_outputs[0], hidden_states)
141
142~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
143 1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
144 1101 or _global_forward_hooks or _global_forward_pre_hooks):
145-> 1102 return forward_call(*input, **kwargs)
146 1103 # Do not call functions when jit is used
147 1104 full_backward_hooks, non_full_backward_hooks = [], []
148
149~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_value, output_attentions)
150 241 attention_scores = attention_scores + relative_position_scores_query + relative_position_scores_key
151 242
152--> 243 attention_scores = attention_scores / math.sqrt(self.attention_head_size)
153 244 if attention_mask is not None:
154 245 # Apply the attention mask is (precomputed for all layers in RobertaModel forward() function)
155
156RuntimeError: CUDA out of memory. Tried to allocate 3.60 GiB (GPU 0; 14.76 GiB total capacity; 7.33 GiB already allocated; 1.37 GiB free; 12.29 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
157---------------------------------------------------------------------------
158RuntimeError Traceback (most recent call last)
159<ipython-input-25-27dfca549a7d> in <module>
160 14
161 15 # classify using gpu, batch_size makes sure we do not run out of memory
162---> 16 classified_docs = doc_classifier.predict(docs_to_classify)
163 17
164 18 # let's see how it looks: there should be a classification result in the meta entry containing labels and scores.
165
166~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/haystack/nodes/document_classifier/transformers.py in predict(self, documents)
167 137 batches = self.get_batches(texts, batch_size=self.batch_size)
168 138 if self.task == 'zero-shot-classification':
169--> 139 batched_predictions = [self.model(batch, candidate_labels=self.labels, truncation=True) for batch in batches]
170 140 elif self.task == 'text-classification':
171 141 batched_predictions = [self.model(batch, return_all_scores=self.return_all_scores, truncation=True) for batch in batches]
172
173~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/haystack/nodes/document_classifier/transformers.py in <listcomp>(.0)
174 137 batches = self.get_batches(texts, batch_size=self.batch_size)
175 138 if self.task == 'zero-shot-classification':
176--> 139 batched_predictions = [self.model(batch, candidate_labels=self.labels, truncation=True) for batch in batches]
177 140 elif self.task == 'text-classification':
178 141 batched_predictions = [self.model(batch, return_all_scores=self.return_all_scores, truncation=True) for batch in batches]
179
180~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/pipelines/zero_shot_classification.py in __call__(self, sequences, candidate_labels, hypothesis_template, multi_label, **kwargs)
181 151 sequences = [sequences]
182 152
183--> 153 outputs = super().__call__(sequences, candidate_labels, hypothesis_template)
184 154 num_sequences = len(sequences)
185 155 candidate_labels = self._args_parser._parse_labels(candidate_labels)
186
187~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/pipelines/base.py in __call__(self, *args, **kwargs)
188 759 def __call__(self, *args, **kwargs):
189 760 inputs = self._parse_and_tokenize(*args, **kwargs)
190--> 761 return self._forward(inputs)
191 762
192 763 def _forward(self, inputs, return_tensors=False):
193
194~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/pipelines/base.py in _forward(self, inputs, return_tensors)
195 780 with torch.no_grad():
196 781 inputs = self.ensure_tensor_on_device(**inputs)
197--> 782 predictions = self.model(**inputs)[0].cpu()
198 783
199 784 if return_tensors:
200
201~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
202 1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
203 1101 or _global_forward_hooks or _global_forward_pre_hooks):
204-> 1102 return forward_call(*input, **kwargs)
205 1103 # Do not call functions when jit is used
206 1104 full_backward_hooks, non_full_backward_hooks = [], []
207
208~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, labels, output_attentions, output_hidden_states, return_dict)
209 1162 output_attentions=output_attentions,
210 1163 output_hidden_states=output_hidden_states,
211-> 1164 return_dict=return_dict,
212 1165 )
213 1166 sequence_output = outputs[0]
214
215~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
216 1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
217 1101 or _global_forward_hooks or _global_forward_pre_hooks):
218-> 1102 return forward_call(*input, **kwargs)
219 1103 # Do not call functions when jit is used
220 1104 full_backward_hooks, non_full_backward_hooks = [], []
221
222~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
223 823 output_attentions=output_attentions,
224 824 output_hidden_states=output_hidden_states,
225--> 825 return_dict=return_dict,
226 826 )
227 827 sequence_output = encoder_outputs[0]
228
229~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
230 1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
231 1101 or _global_forward_hooks or _global_forward_pre_hooks):
232-> 1102 return forward_call(*input, **kwargs)
233 1103 # Do not call functions when jit is used
234 1104 full_backward_hooks, non_full_backward_hooks = [], []
235
236~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
237 513 encoder_attention_mask,
238 514 past_key_value,
239--> 515 output_attentions,
240 516 )
241 517
242
243~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
244 1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
245 1101 or _global_forward_hooks or _global_forward_pre_hooks):
246-> 1102 return forward_call(*input, **kwargs)
247 1103 # Do not call functions when jit is used
248 1104 full_backward_hooks, non_full_backward_hooks = [], []
249
250~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_value, output_attentions)
251 398 head_mask,
252 399 output_attentions=output_attentions,
253--> 400 past_key_value=self_attn_past_key_value,
254 401 )
255 402 attention_output = self_attention_outputs[0]
256
257~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
258 1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
259 1101 or _global_forward_hooks or _global_forward_pre_hooks):
260-> 1102 return forward_call(*input, **kwargs)
261 1103 # Do not call functions when jit is used
262 1104 full_backward_hooks, non_full_backward_hooks = [], []
263
264~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_value, output_attentions)
265 328 encoder_attention_mask,
266 329 past_key_value,
267--> 330 output_attentions,
268 331 )
269 332 attention_output = self.output(self_outputs[0], hidden_states)
270
271~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
272 1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
273 1101 or _global_forward_hooks or _global_forward_pre_hooks):
274-> 1102 return forward_call(*input, **kwargs)
275 1103 # Do not call functions when jit is used
276 1104 full_backward_hooks, non_full_backward_hooks = [], []
277
278~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_value, output_attentions)
279 241 attention_scores = attention_scores + relative_position_scores_query + relative_position_scores_key
280 242
281--> 243 attention_scores = attention_scores / math.sqrt(self.attention_head_size)
282 244 if attention_mask is not None:
283 245 # Apply the attention mask is (precomputed for all layers in RobertaModel forward() function)
284
285RuntimeError: CUDA out of memory. Tried to allocate 3.60 GiB (GPU 0; 14.76 GiB total capacity; 7.33 GiB already allocated; 1.37 GiB free; 12.29 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
286doc_classifier = TransformersDocumentClassifier(model_name_or_path="cross-encoder/nli-distilroberta-base",
287 task="zero-shot-classification",
288 labels=["music", "natural language processing", "history"],
289 batch_size=16
290)
291
292# ----------
293
294# convert to Document using a fieldmap for custom content fields the classification should run on
295docs_to_classify = [Document.from_dict(d) for d in docs_sliding_window]
296
297# ----------
298
299# classify using gpu, batch_size makes sure we do not run out of memory
300classified_docs = doc_classifier.predict(docs_to_classify)
301
302# ----------
303
304# let's see how it looks: there should be a classification result in the meta entry containing labels and scores.
305print(classified_docs[0].to_dict())
306batch_size=2
307
QUESTION
News article extract using requests,bs4 and newspaper packages. why doesn't links=soup.select(".r a") find anything?. This code was working earlier
Asked 2021-Nov-16 at 22:43Objective: I am trying to download the news article based on the keywords to perform sentiment analysis.
This code was working a few months ago but now it returns a null value. I tried fixing the issue butlinks=soup.select(".r a")
return null value.
1import pandas as pd
2import requests
3from bs4 import BeautifulSoup
4import string
5import nltk
6from urllib.request import urlopen
7import sys
8import webbrowser
9import newspaper
10import time
11from newspaper import Article
12
13Company_name1 =[]
14Article_number1=[]
15Article_Title1=[]
16Article_Authors1=[]
17Article_pub_date1=[]
18Article_Text1=[]
19Article_Summary1=[]
20Article_Keywords1=[]
21Final_dataframe=[]
22
23class Newspapr_pd:
24 def __init__(self,term):
25 self.term=term
26 self.subjectivity=0
27 self.sentiment=0
28 self.url='https://www.google.com/search?q={0}&safe=active&tbs=qdr:w,sdb:1&tbm=nws&source=lnt&dpr=1'.format(self.term)
29
30 def NewsArticlerun_pd(self):
31 response=requests.get(self.url)
32 response.raise_for_status()
33 #print(response.text)
34 soup=bs4.BeautifulSoup(response.text,'html.parser')
35 links=soup.select(".r a")
36
37 numOpen = min(5, len(links))
38 Article_number=0
39 for i in range(numOpen):
40 response_links = webbrower.open("https://www.google.com" + links[i].get("href"))
41
42
43
44 #For different language newspaper refer above table
45 article = Article(response_links, language="en") # en for English
46 Article_number+=1
47
48 print('*************************************************************************************')
49
50 Article_number1.append(Article_number)
51 Company_name1.append(self.term)
52
53 #To download the article
54 try:
55
56 article.download()
57 #To parse the article
58 article.parse()
59 #To perform natural language processing ie..nlp
60 article.nlp()
61
62 #To extract title
63 Article_Title1.append(article.title)
64
65
66 #To extract text
67 Article_Text1.append(article.text)
68
69
70 #To extract Author name
71 Article_Authors1.append(article.authors)
72
73
74 #To extract article published date
75 Article_pub_date1.append(article.publish_date)
76
77
78
79 #To extract summary
80 Article_Summary1.append(article.summary)
81
82
83
84 #To extract keywords
85 Article_Keywords1.append(article.keywords)
86
87 except:
88 print('Error in loading page')
89 continue
90
91 for art_num,com_name,title,text,auth,pub_dt,summaries,keywds in zip(Article_number1,Company_name1,Article_Title1,Article_Text1,Article_Authors1,Article_pub_date1,Article_Summary1,Article_Keywords1):
92 Final_dataframe.append({'Article_link_num':art_num, 'Company_name':com_name,'Article_Title':title,'Article_Text':text,'Article_Author':auth,
93 'Article_Published_date':pub_dt,'Article_Summary':summaries,'Article_Keywords':keywds})
94
95list_of_companies=['Amazon','Jetairways','nirav modi']
96
97for i in list_of_companies:
98 comp = str('"'+ i + '"')
99 a=Newspapr_pd(comp)
100 a.NewsArticlerun_pd()
101
102Final_new_dataframe=pd.DataFrame(Final_dataframe)
103Final_new_dataframe.tail()
104
ANSWER
Answered 2021-Nov-16 at 22:43This is a very complex issue, because Google News continually changes their class names. Additionally Google will add various prefixes to article urls and throw in some hidden ad or social media tags.
The answer below only addresses scraping articles from Google news. More testing is needed to determine how it works with a large amount of keywords and with Google News changing page structure.
The Newspaper3k
extraction is even more complex, because each article can have a different structure. I would recommend looking at my Newspaper3k Usage Overview document for details on how to design that part of your code.
P.S. I'm current writing a new news scraper, because the development for Newspaper3k is dead. I'm unsure of the release date of my code.
1import pandas as pd
2import requests
3from bs4 import BeautifulSoup
4import string
5import nltk
6from urllib.request import urlopen
7import sys
8import webbrowser
9import newspaper
10import time
11from newspaper import Article
12
13Company_name1 =[]
14Article_number1=[]
15Article_Title1=[]
16Article_Authors1=[]
17Article_pub_date1=[]
18Article_Text1=[]
19Article_Summary1=[]
20Article_Keywords1=[]
21Final_dataframe=[]
22
23class Newspapr_pd:
24 def __init__(self,term):
25 self.term=term
26 self.subjectivity=0
27 self.sentiment=0
28 self.url='https://www.google.com/search?q={0}&safe=active&tbs=qdr:w,sdb:1&tbm=nws&source=lnt&dpr=1'.format(self.term)
29
30 def NewsArticlerun_pd(self):
31 response=requests.get(self.url)
32 response.raise_for_status()
33 #print(response.text)
34 soup=bs4.BeautifulSoup(response.text,'html.parser')
35 links=soup.select(".r a")
36
37 numOpen = min(5, len(links))
38 Article_number=0
39 for i in range(numOpen):
40 response_links = webbrower.open("https://www.google.com" + links[i].get("href"))
41
42
43
44 #For different language newspaper refer above table
45 article = Article(response_links, language="en") # en for English
46 Article_number+=1
47
48 print('*************************************************************************************')
49
50 Article_number1.append(Article_number)
51 Company_name1.append(self.term)
52
53 #To download the article
54 try:
55
56 article.download()
57 #To parse the article
58 article.parse()
59 #To perform natural language processing ie..nlp
60 article.nlp()
61
62 #To extract title
63 Article_Title1.append(article.title)
64
65
66 #To extract text
67 Article_Text1.append(article.text)
68
69
70 #To extract Author name
71 Article_Authors1.append(article.authors)
72
73
74 #To extract article published date
75 Article_pub_date1.append(article.publish_date)
76
77
78
79 #To extract summary
80 Article_Summary1.append(article.summary)
81
82
83
84 #To extract keywords
85 Article_Keywords1.append(article.keywords)
86
87 except:
88 print('Error in loading page')
89 continue
90
91 for art_num,com_name,title,text,auth,pub_dt,summaries,keywds in zip(Article_number1,Company_name1,Article_Title1,Article_Text1,Article_Authors1,Article_pub_date1,Article_Summary1,Article_Keywords1):
92 Final_dataframe.append({'Article_link_num':art_num, 'Company_name':com_name,'Article_Title':title,'Article_Text':text,'Article_Author':auth,
93 'Article_Published_date':pub_dt,'Article_Summary':summaries,'Article_Keywords':keywds})
94
95list_of_companies=['Amazon','Jetairways','nirav modi']
96
97for i in list_of_companies:
98 comp = str('"'+ i + '"')
99 a=Newspapr_pd(comp)
100 a.NewsArticlerun_pd()
101
102Final_new_dataframe=pd.DataFrame(Final_dataframe)
103Final_new_dataframe.tail()
104import requests
105import re as regex
106from bs4 import BeautifulSoup
107
108
109def get_google_news_article(search_string):
110 articles = []
111 url = f'https://www.google.com/search?q={search_string}&safe=active&tbs=qdr:w,sdb:1&tbm=nws&source=lnt&dpr=1'
112 response = requests.get(url)
113 raw_html = BeautifulSoup(response.text, "lxml")
114 main_tag = raw_html.find('div', {'id': 'main'})
115 for div_tag in main_tag.find_all('div', {'class': regex.compile('xpd')}):
116 for a_tag in div_tag.find_all('a', href=True):
117 if not a_tag.get('href').startswith('/search?'):
118 none_articles = bool(regex.search('amazon.com|facebook.com|twitter.com|youtube.com|wikipedia.org', a_tag['href']))
119 if none_articles is False:
120 if a_tag.get('href').startswith('/url?q='):
121 find_article = regex.search('(.*)(&sa=)', a_tag.get('href'))
122 article = find_article.group(1).replace('/url?q=', '')
123 if article.startswith('https://'):
124 articles.append(article)
125
126 return articles
127
128
129
130list_of_companies = ['amazon', 'jet airways', 'nirav modi']
131for company_name in list_of_companies:
132 print(company_name)
133 search_results = get_google_news_article(company_name)
134 for item in sorted(set(search_results)):
135 print(item)
136 print('\n')
137
This is the output from the code above:
1import pandas as pd
2import requests
3from bs4 import BeautifulSoup
4import string
5import nltk
6from urllib.request import urlopen
7import sys
8import webbrowser
9import newspaper
10import time
11from newspaper import Article
12
13Company_name1 =[]
14Article_number1=[]
15Article_Title1=[]
16Article_Authors1=[]
17Article_pub_date1=[]
18Article_Text1=[]
19Article_Summary1=[]
20Article_Keywords1=[]
21Final_dataframe=[]
22
23class Newspapr_pd:
24 def __init__(self,term):
25 self.term=term
26 self.subjectivity=0
27 self.sentiment=0
28 self.url='https://www.google.com/search?q={0}&safe=active&tbs=qdr:w,sdb:1&tbm=nws&source=lnt&dpr=1'.format(self.term)
29
30 def NewsArticlerun_pd(self):
31 response=requests.get(self.url)
32 response.raise_for_status()
33 #print(response.text)
34 soup=bs4.BeautifulSoup(response.text,'html.parser')
35 links=soup.select(".r a")
36
37 numOpen = min(5, len(links))
38 Article_number=0
39 for i in range(numOpen):
40 response_links = webbrower.open("https://www.google.com" + links[i].get("href"))
41
42
43
44 #For different language newspaper refer above table
45 article = Article(response_links, language="en") # en for English
46 Article_number+=1
47
48 print('*************************************************************************************')
49
50 Article_number1.append(Article_number)
51 Company_name1.append(self.term)
52
53 #To download the article
54 try:
55
56 article.download()
57 #To parse the article
58 article.parse()
59 #To perform natural language processing ie..nlp
60 article.nlp()
61
62 #To extract title
63 Article_Title1.append(article.title)
64
65
66 #To extract text
67 Article_Text1.append(article.text)
68
69
70 #To extract Author name
71 Article_Authors1.append(article.authors)
72
73
74 #To extract article published date
75 Article_pub_date1.append(article.publish_date)
76
77
78
79 #To extract summary
80 Article_Summary1.append(article.summary)
81
82
83
84 #To extract keywords
85 Article_Keywords1.append(article.keywords)
86
87 except:
88 print('Error in loading page')
89 continue
90
91 for art_num,com_name,title,text,auth,pub_dt,summaries,keywds in zip(Article_number1,Company_name1,Article_Title1,Article_Text1,Article_Authors1,Article_pub_date1,Article_Summary1,Article_Keywords1):
92 Final_dataframe.append({'Article_link_num':art_num, 'Company_name':com_name,'Article_Title':title,'Article_Text':text,'Article_Author':auth,
93 'Article_Published_date':pub_dt,'Article_Summary':summaries,'Article_Keywords':keywds})
94
95list_of_companies=['Amazon','Jetairways','nirav modi']
96
97for i in list_of_companies:
98 comp = str('"'+ i + '"')
99 a=Newspapr_pd(comp)
100 a.NewsArticlerun_pd()
101
102Final_new_dataframe=pd.DataFrame(Final_dataframe)
103Final_new_dataframe.tail()
104import requests
105import re as regex
106from bs4 import BeautifulSoup
107
108
109def get_google_news_article(search_string):
110 articles = []
111 url = f'https://www.google.com/search?q={search_string}&safe=active&tbs=qdr:w,sdb:1&tbm=nws&source=lnt&dpr=1'
112 response = requests.get(url)
113 raw_html = BeautifulSoup(response.text, "lxml")
114 main_tag = raw_html.find('div', {'id': 'main'})
115 for div_tag in main_tag.find_all('div', {'class': regex.compile('xpd')}):
116 for a_tag in div_tag.find_all('a', href=True):
117 if not a_tag.get('href').startswith('/search?'):
118 none_articles = bool(regex.search('amazon.com|facebook.com|twitter.com|youtube.com|wikipedia.org', a_tag['href']))
119 if none_articles is False:
120 if a_tag.get('href').startswith('/url?q='):
121 find_article = regex.search('(.*)(&sa=)', a_tag.get('href'))
122 article = find_article.group(1).replace('/url?q=', '')
123 if article.startswith('https://'):
124 articles.append(article)
125
126 return articles
127
128
129
130list_of_companies = ['amazon', 'jet airways', 'nirav modi']
131for company_name in list_of_companies:
132 print(company_name)
133 search_results = get_google_news_article(company_name)
134 for item in sorted(set(search_results)):
135 print(item)
136 print('\n')
137amazon
138https://9to5mac.com/2021/11/15/amazon-releases-native-prime-video-app-for-macos-with-purchase-support-and-more/
139https://wtvbam.com/2021/11/15/india-police-to-question-amazon-executives-in-probe-over-marijuana-smuggling/
140https://www.cnet.com/home/smart-home/all-the-new-amazon-features-for-your-smart-home-alexa-disney-echo/
141https://www.cnet.com/tech/amazon-unveils-black-friday-deals-starting-on-nov-25/
142https://www.crossroadstoday.com/i/amazons-best-black-friday-deals-for-2021-2/
143https://www.reuters.com/technology/ibm-amazon-partner-extend-reach-data-tools-oil-companies-2021-11-15/
144https://www.theverge.com/2021/11/15/22783275/amazon-basics-smart-switches-price-release-date-specs
145https://www.tomsguide.com/news/amazon-echo-motion-detection
146https://www.usatoday.com/story/money/shopping/2021/11/15/amazon-black-friday-2021-deals-online/8623710002/
147https://www.winknews.com/2021/11/15/new-amazon-sortation-center-began-operations-monday-could-bring-faster-deliveries/
148
149jet airways
150https://economictimes.indiatimes.com/markets/expert-view/first-time-in-two-decades-new-airlines-are-starting-instead-of-closing-down-jyotiraditya-scindia/articleshow/87660724.cms
151https://menafn.com/1103125331/Jet-Airways-to-resume-operations-in-Q1-2022
152https://simpleflying.com/jet-airways-100-aircraft-5-years/
153https://simpleflying.com/jet-airways-q3-loss/
154https://www.business-standard.com/article/companies/defunct-carrier-jet-airways-posts-rs-306-cr-loss-in-september-quarter-121110901693_1.html
155https://www.business-standard.com/article/markets/stocks-to-watch-ril-aurobindo-bhel-m-m-jet-airways-idfc-powergrid-121110900189_1.html
156https://www.financialexpress.com/market/nykaa-hdfc-zee-media-jet-airways-power-grid-berger-paints-petronet-lng-stocks-in-focus/2366063/
157https://www.moneycontrol.com/news/business/earnings/jet-airways-standalone-september-2021-net-sales-at-rs-41-02-crore-up-313-51-y-o-y-7702891.html
158https://www.spokesman.com/stories/2021/nov/11/boeing-set-to-dent-airbus-india-dominance-with-737/
159https://www.timesnownews.com/business-economy/industry/article/times-now-summit-2021-jet-airways-will-make-a-comeback-into-indian-skies-akasa-to-take-off-next-year-says-jyotiraditya-scindia/831090
160
161
162nirav modi
163https://m.republicworld.com/india-news/general-news/piyush-goyal-says-few-rotten-eggs-destroyed-credibility-of-countrys-ca-sector.html
164https://www.bulletnews.net/akkad-bakkad-rafu-chakkar-review-the-story-of-robbing-people-by-making-fake-banks/
165https://www.daijiworld.com/news/newsDisplay%3FnewsID%3D893048
166https://www.devdiscourse.com/article/law-order/1805317-hc-seeks-centres-stand-on-bankers-challenge-to-dismissal-from-service
167https://www.geo.tv/latest/381560-arif-naqvis-extradition-case-to-be-heard-after-nirav-modi-case-ruling
168https://www.hindustantimes.com/india-news/cbiand-ed-appointments-that-triggered-controversies-101636954580012.html
169https://www.law360.com/articles/1439470/suicide-test-ruling-delays-abraaj-founder-s-extradition-case
170https://www.moneycontrol.com/news/trends/current-affairs-trends/nirav-modi-extradition-case-outcome-of-appeal-to-also-affect-pakistani-origin-global-financier-facing-16-charges-of-fraud-and-money-laundering-7717231.html
171https://www.thehansindia.com/hans/opinion/news-analysis/uniform-law-needed-for-free-exit-of-rich-businessmen-714566
172https://www.thenews.com.pk/print/908374-uk-judge-delays-arif-naqvi-s-extradition-to-us
173
174
QUESTION
Changing Stanza model directory for pyinstaller executable
Asked 2021-Nov-01 at 06:03I have an application that analyses text looking for keywords using natural language processing.
I created an executable and it works fine on my computer.
I've sent it to a friend but in his computer he gets an error:
1Traceback (most recent call last):
2 File "Main.py", line 15, in <module>
3 File "Menu.py", line 349, in main_menu
4 File "Menu.py", line 262, in analyse_text_menu
5 File "Menu.py", line 178, in analyse_text_function
6 File "AnalyseText\ProcessText.py", line 232, in process_text
7 File "AnalyseText\ProcessText.py", line 166, in generate_keyword_complete_list
8 File "AnalyseText\ProcessText.py", line 135, in lemmatize_text
9 File "stanza\pipeline\core.py", line 88, in _init_
10stanza.pipeline.core.ResourcesFileNotFoundError: Resources file not found at: C:\Users\jpovoas\stanza_resources\resources.json Try to download the model again.
11[26408] Failed to execute script 'Main' due to unhandled exception!
12
It's looking for resources.json
inside a folder in his computer. Even though I've added stanza
as a hidden import with pyinstaller.
I'm using a model in another language, as opposed to the default one in english. The model is located in a folder inside the User folder.
The thing is, I don't want the end user to have to download the model separatedly.
I've managed to include the model folder with --add-data C:\Users\Laila\stanza_resources\pt;Stanza"
when creating the executable.
It still looks for the model's json file inside the stanza_resources
folder that's should be inside the User folder of whoever is using the program.
How do I tell stanza to look for the model inside the executable folder generated instead?
Can I just add stanza.download("language")
in my script? If so how do I change stanza's model download folder? I want it to be downloaded into a folder inside the same directory as the executable. How do I do that?
ANSWER
Answered 2021-Sep-14 at 04:27You can try downloading that JSON file.
Here is an snippet for the same.
1Traceback (most recent call last):
2 File "Main.py", line 15, in <module>
3 File "Menu.py", line 349, in main_menu
4 File "Menu.py", line 262, in analyse_text_menu
5 File "Menu.py", line 178, in analyse_text_function
6 File "AnalyseText\ProcessText.py", line 232, in process_text
7 File "AnalyseText\ProcessText.py", line 166, in generate_keyword_complete_list
8 File "AnalyseText\ProcessText.py", line 135, in lemmatize_text
9 File "stanza\pipeline\core.py", line 88, in _init_
10stanza.pipeline.core.ResourcesFileNotFoundError: Resources file not found at: C:\Users\jpovoas\stanza_resources\resources.json Try to download the model again.
11[26408] Failed to execute script 'Main' due to unhandled exception!
12import urllib.request
13
14json_url = "http://LINK_TO_YOUR_JSON/resources.json"
15
16urllib.request.urlretrieve(json_url, "./resources.json")
17
QUESTION
Type-Token Ratio in Google Sheets: How to manipulate long strings of text (millions of characters)
Asked 2021-Oct-03 at 11:39Here’s the challenge. In a Google Sheets spreadsheet, I have a column in which can be found a range of cells containing lists of words separated by comas, one per row, up to a thousand row. Each list show the words taken from a text, in alpha-numeral order, from a few hundred to a few thousand words. I need to count both the total of words in all the rows, taken together, and the number of unique word forms too. In other words, from the glossary of natural language processing, I want to know the number of tokens and the number of types in my corpus, in order to calculate the type-token ratio or lexical density.
In particular, finding the number of unique word forms in the whole column have proven to be a challenge. In an ARRAY FORMULA, with corresponding functions, I’ve JOINED the strings, SPLITED the words, TRANSPOSED them, then removed duplicates with UNIQUE function, then counted the remaining word forms. This worked on a sample corpus constituted of a little over ten lists of words, but failed when I reached fifteen or so lists of words taken together, a far cry from the thousand lists I need to join in my formula to obtain the results I am looking for.
From what I can gather, the problem would reside in that the resulting string I intend to manipulate is exceeding 50,000 characters. Here and there, for specific cases, I’ve found similar questions, and propositions for workarounds, mostly through custom functions, but I could not replicate the result. Needless to say, writing custom fonctions on my own is beyond my reach. Someone suggests to use QUERY headers, but I did not figured either if this was of any help in my case.
The formulas I came up with are the following:
To obtain the total number of words (tokens) through all the lists:
=COUNTA(ARRAYFORMULA(SPLIT(JOIN(",";1;B2:B);",")))
To obtain the number of unique word forms (types) through all the lists:
=COUNTA(ARRAYFORMULA(UNIQUE(TRANSPOSE(SPLIT(JOIN(",";1;B2:B);",")))))
A sample in a spreadsheet can be found here.
EDIT 1:
I’ve included the column of texts stripped of ponctuation, from which the lists of words are generated, and the formula used to generate them.
EDIT 2:
Changed the title to better reflect the general intent.
ANSWER
Answered 2021-Oct-02 at 13:57For total items, try:
1=arrayformula(query(flatten(iferror(split(B2:B;",";1);));"select count(Col1) where Col1 !='' label count(Col1) '' ";0))
2
For total unique items:
1=arrayformula(query(flatten(iferror(split(B2:B;",";1);));"select count(Col1) where Col1 !='' label count(Col1) '' ";0))
2=arrayformula(query(unique(flatten(iferror(split(B2:B;",";1);)));"select count(Col1) where Col1 !='' label count(Col1) '' ";0))
3
You might get problems if you have too many rows in the sheet. If so, set the range limit to something like B2:B1000
Add this to cell C1 to get a list of 'Comma separated items':
1=arrayformula(query(flatten(iferror(split(B2:B;",";1);));"select count(Col1) where Col1 !='' label count(Col1) '' ";0))
2=arrayformula(query(unique(flatten(iferror(split(B2:B;",";1);)));"select count(Col1) where Col1 !='' label count(Col1) '' ";0))
3=arrayformula({"Comma separated items";if(B2:B<>"";len(regexreplace(B2:B;"[^\,]";))+1;)})
4
Explanation:
The arrayformula()
allows the calculation to cascade down the sheet, from one cell.
So within the arrayformula()
, the starting point is the split(B2:B;",")
to create columns for each of the comma separated items.
The iferror(split(B2:B;",");"")
leaves a blank where cells don't have a comma (like those from row 32). Instead of ;"")
shown above I usually just use ;)
, removing ""
so nothing is the result of the iferror.
Then flatten()
takes all of the columns and flattens them into a single column.
query()
is needed to count the resulting column count(Col1)
where no cells are empty where Col1 !=''
, and the label count(Col1) ''
removea a label 'count' which would usually be displayed.
For the list of unique values, unique()
is placed before thequery()
, after the flatten()
.
QUESTION
a bug for tf.keras.layers.TextVectorization when built from saved configs and weights
Asked 2021-Sep-28 at 13:57I have tried writing a python program to save tf.keras.layers.TextVectorization to disk and load it with the answer of How to save TextVectorization to disk in tensorflow?.
The TextVectorization layer built from saved configs outputs a vector with wrong length when the arg output_sequence_length
is not None
and output_mode='int'
.
For example, if I set output_sequence_length= 10
, and output_mode='int'
, it is expected that given a text, TextVectorization should output a vector with length of 10, see vectorizer
and new_v2
in the code below.
However, if TextVectorization's arg output_mode='int'
is set from saved configs, it doesn't output a vector with length of 10(actually it is 9, the real length of the sentence. It seems like output_sequence_length
is not set successfully). See the object new_v1
in the code below.
The interesting thing is, I have compared from_disk['config']['output_mode']
and 'int'
, they equal to each other.
1import tensorflow as tf
2from tensorflow.keras.models import load_model
3import pickle
4# In[]
5max_len = 10 # Sequence length to pad the outputs to.
6text_dataset = tf.data.Dataset.from_tensor_slices([
7 "I like natural language processing",
8 "You like computer vision",
9 "I like computer games and computer science"])
10# Fit a TextVectorization layer
11VOCAB_SIZE = 10 # Maximum vocab size.
12vectorizer = tf.keras.layers.TextVectorization(
13 max_tokens=None,
14 standardize="lower_and_strip_punctuation",
15 split="whitespace",
16 output_mode='int',
17 output_sequence_length=max_len
18 )
19vectorizer.adapt(text_dataset.batch(64))
20# In[]
21#print(vectorizer.get_vocabulary())
22#print(vectorizer.get_config())
23#print(vectorizer.get_weights())
24# In[]
25
26
27# Pickle the config and weights
28pickle.dump({'config': vectorizer.get_config(),
29 'weights': vectorizer.get_weights()}
30 , open("./models/tv_layer.pkl", "wb"))
31
32
33# Later you can unpickle and use
34# `config` to create object and
35# `weights` to load the trained weights.
36
37from_disk = pickle.load(open("./models/tv_layer.pkl", "rb"))
38
39new_v1 = tf.keras.layers.TextVectorization(
40 max_tokens=None,
41 standardize="lower_and_strip_punctuation",
42 split="whitespace",
43 output_mode=from_disk['config']['output_mode'],
44 output_sequence_length=from_disk['config']['output_sequence_length'],
45 )
46# You have to call `adapt` with some dummy data (BUG in Keras)
47new_v1.adapt(tf.data.Dataset.from_tensor_slices(["xyz"]))
48new_v1.set_weights(from_disk['weights'])
49new_v2 = tf.keras.layers.TextVectorization(
50 max_tokens=None,
51 standardize="lower_and_strip_punctuation",
52 split="whitespace",
53 output_mode='int',
54 output_sequence_length=from_disk['config']['output_sequence_length'],
55 )
56
57# You have to call `adapt` with some dummy data (BUG in Keras)
58new_v2.adapt(tf.data.Dataset.from_tensor_slices(["xyz"]))
59new_v2.set_weights(from_disk['weights'])
60print ("*"*10)
61# In[]
62test_sentence="Jack likes computer scinece, computer games, and foreign language"
63
64print(vectorizer(test_sentence))
65print (new_v1(test_sentence))
66print (new_v2(test_sentence))
67print(from_disk['config']['output_mode']=='int')
68
Here are the print() outputs:
1import tensorflow as tf
2from tensorflow.keras.models import load_model
3import pickle
4# In[]
5max_len = 10 # Sequence length to pad the outputs to.
6text_dataset = tf.data.Dataset.from_tensor_slices([
7 "I like natural language processing",
8 "You like computer vision",
9 "I like computer games and computer science"])
10# Fit a TextVectorization layer
11VOCAB_SIZE = 10 # Maximum vocab size.
12vectorizer = tf.keras.layers.TextVectorization(
13 max_tokens=None,
14 standardize="lower_and_strip_punctuation",
15 split="whitespace",
16 output_mode='int',
17 output_sequence_length=max_len
18 )
19vectorizer.adapt(text_dataset.batch(64))
20# In[]
21#print(vectorizer.get_vocabulary())
22#print(vectorizer.get_config())
23#print(vectorizer.get_weights())
24# In[]
25
26
27# Pickle the config and weights
28pickle.dump({'config': vectorizer.get_config(),
29 'weights': vectorizer.get_weights()}
30 , open("./models/tv_layer.pkl", "wb"))
31
32
33# Later you can unpickle and use
34# `config` to create object and
35# `weights` to load the trained weights.
36
37from_disk = pickle.load(open("./models/tv_layer.pkl", "rb"))
38
39new_v1 = tf.keras.layers.TextVectorization(
40 max_tokens=None,
41 standardize="lower_and_strip_punctuation",
42 split="whitespace",
43 output_mode=from_disk['config']['output_mode'],
44 output_sequence_length=from_disk['config']['output_sequence_length'],
45 )
46# You have to call `adapt` with some dummy data (BUG in Keras)
47new_v1.adapt(tf.data.Dataset.from_tensor_slices(["xyz"]))
48new_v1.set_weights(from_disk['weights'])
49new_v2 = tf.keras.layers.TextVectorization(
50 max_tokens=None,
51 standardize="lower_and_strip_punctuation",
52 split="whitespace",
53 output_mode='int',
54 output_sequence_length=from_disk['config']['output_sequence_length'],
55 )
56
57# You have to call `adapt` with some dummy data (BUG in Keras)
58new_v2.adapt(tf.data.Dataset.from_tensor_slices(["xyz"]))
59new_v2.set_weights(from_disk['weights'])
60print ("*"*10)
61# In[]
62test_sentence="Jack likes computer scinece, computer games, and foreign language"
63
64print(vectorizer(test_sentence))
65print (new_v1(test_sentence))
66print (new_v2(test_sentence))
67print(from_disk['config']['output_mode']=='int')
68**********
69tf.Tensor([ 1 1 3 1 3 11 12 1 10 0], shape=(10,), dtype=int64)
70tf.Tensor([ 1 1 3 1 3 11 12 1 10], shape=(9,), dtype=int64)
71tf.Tensor([ 1 1 3 1 3 11 12 1 10 0], shape=(10,), dtype=int64)
72True
73
Does anyone know why?
ANSWER
Answered 2021-Sep-28 at 13:57the bug is fixed by the PR in https://github.com/keras-team/keras/pull/15422
Community Discussions contain sources that include Stack Exchange Network
Tutorials and Learning Resources in Natural Language Processing
Tutorials and Learning Resources are not available at this moment for Natural Language Processing