edgar | A crawler to get company filing data from XBRL filings | Data Visualization library
kandi X-RAY | edgar Summary
kandi X-RAY | edgar Summary
A crawler to get company filing data from XBRL filings. The fetcher parses through the HTML pages and extracts data based on the XBRL tags that it finds and collects it into filing data arranged by filing date.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
- setData provides a function to set data
- validate financial report
- mapReports extracts the report number from a page .
- lookupDocType looks for the document type
- failingPageParser parses a file and parses it into a file .
- parseHyperLinkTag parses a hyperlink tag .
- parseTableRow parses a table row
- Normalize number
- Get missing documents
- parseTableHeading parses a table heading
edgar Key Features
edgar Examples and Code Snippets
Community Discussions
Trending Discussions on edgar
QUESTION
So, I'm a very amateur python programmer but hope all I'll explain makes sense.
I want to scrape a type of Financial document called "10-K". I'm just interested in a little part of the whole document. An example of the URL I try to scrape is: https://www.sec.gov/Archives/edgar/data/320193/0000320193-20-000096.txt
Now, if I download this document as a .txt, It "only" weights 12mb. So for my ignorance doesn't make much sense this takes 1-2 min to .read()
(even I got a decent PC).
The original code I was using:
...ANSWER
Answered 2021-Jun-13 at 18:07The time it takes to read a document over the internet is really not related to the speed of your computer, at least in most cases. The most important determinant is the speed of your internet connection. Another important determinant is the speed with which the remote server responds to your request, which will depend in part on how many other requests the remote server is currently trying to handle.
It's also possible that the slow-down is not due to either of the above causes, but rather to measures taken by the remote server to limit scraping or to avoid congestion. It's very common for servers to deliberately reduce responsiveness to clients which make frequent requests, or even to deny the requests entirely. Or to reduce the speed of data transmission to everyone, which is another way of controlling server load. In that case, there's not much you're going to be able to do to speed up reading the requests.
From my machine, it takes a bit under 30 seconds to download the 12MB document. Since I'm in Perú it's possible that the speed of the internet connection is a factor, but I suspect that it's not the only issue. However, the data transmission does start reasonably quickly.
If the problem were related to the speed of data transfer between your machine and the server, you could speed things up by using a streaming parser (a phrase you can search for). A streaming parser reads its input in small chunks and assembles them on the fly into tokens, which is basically what you are trying to do. But the streaming parser will deal transparently with the most difficult part, which is to avoid tokens being split between two chunks. However, the nature of the SEC document, which taken as a whole is not very pure HTML, might make it difficult to use standard tools.
Since the part of the document you want to analyse is well past the middle, at least in the example you presented, you won't be able to reduce the download time by much. But that might still be worthwhile.
The basic approach you describe is workable, but you'll need to change it a bit in order to cope with the search strings being split between chunks, as you noted. The basic idea is to append successive chunks until you find the string, rather than just looking at them one at a time.
I'd suggest first identifying the entire document and then deciding whether it's the document you want. That reduces the search issue to a single string, the document terminator (\n\n
; the newlines are added to reduce the possibility of false matches).
Here's a very crude implementation, which I suggest you take as an example rather than just copying it into your program. The function docs
yields successive complete documents from a url; the caller can use that to select the one they want. (In the sample code, the first matching document is used, although there are actually two matches in the complete file. If you want all matches, then you will have to read the entire input, in which case you won't have any speed-up at all, although you might still have some savings from not having to parse everything.)
QUESTION
I'm trying to scrape bunch of websites. All of them have one particular table with some changes. For example: if you check this URL.
It has the attribute value href="#icaec13e17ee4432d9971f5e4b3d32ba1_265"
and refers to the tag
..
. So, I'll only have the attribute value icaec13e17ee4432d9971f5e4b3d32ba1_265
. The tag name and the attribute name varies. How to get them with attribute value?
...ANSWER
Answered 2021-May-31 at 08:52You could define a filter function that checks if there is one HTML tag with a attribute value equal to value
:
QUESTION
Link to url I'm working with: https://www.sec.gov/Archives/edgar/data/789019/000106299321002323/0001062993-21-002323.txt
I can access the text/values contained in some tags, but not in others.
Setup (how I got to the BS soup object):
...ANSWER
Answered 2021-Apr-28 at 22:00You need to use lxml's XML parser.
For HTML:
QUESTION
ANSWER
Answered 2021-Apr-11 at 17:46You could convert your date column to datetime
, and then use pd.Grouper
with groupby
, as per below:
QUESTION
I am using the pandas_read_xml package for reading and processing xml files into a pandas dataframe. The package works absolutely fine for my purpose in the vast majority of cases. However, the dataframe output is kind of off when reading a url with just a single tag. Let me illustrate this with the following two examples.
...ANSWER
Answered 2021-Apr-07 at 13:55First of all, thanks for the feedback! I wrote pandas-read-xml because pandas did not have a pd.read_xml() implementation. You (and the rest of us) will be pleased to know that there is a dev version of pandas read_xml which should be coming soon! (https://pandas.pydata.org/docs/dev/reference/api/pandas.read_xml.html)
As for you current conundrum, this is a result (and one of my many dislikes towards) of the structure of XML. Unlike JSON, where single elements can be returned within a list, the XML structure just has one XML tag, which is interpreted as a single value rather than a list.
Essentially, if there is only one "row" tag, then the "column" tags is now treated as column tags... I'm not making much sense am I? Let me explain with your examples.
Here is how I suggest you use it:
QUESTION
I'm using google app script
...ANSWER
Answered 2021-Apr-04 at 01:58Although, unfortunately, I cannot replicate your situation, from but once a while, I get a failed execution and this is the response.
, for example, how about retrying the request as follows?
QUESTION
I have two ArrayLists of ClassRoom
object and below shows the ClassRoom
class :
ANSWER
Answered 2021-Mar-27 at 11:30You can do like this:
QUESTION
I'm updating some code that used to use Xml.parse
to parse this page https://www.sec.gov/cgi-bin/browse-edgar?company=&CIK=&type=8-k&owner=exclude&count=100&action=getcurrent
The old code uses Xml
to get the table like... this
ANSWER
Answered 2021-Mar-27 at 01:14I believe your goal as follows.
- You want to retrieve the values from
entry
of the XML data and want to put the values to the Spreadsheet using Google Apps Script.
- When I saw the data from the URL of
https://www.sec.gov/cgi-bin/browse-edgar?action=getcurrent&CIK=&type=8-k&company=&dateb=&owner=include&start=0&count=40&output=atom
, I confirmed that the data is the XML data. - When I saw your script, it seems that
entry
is not retrieved.
QUESTION
I am using an R package, edgarWebR, to parse SEC filings, such as https://www.sec.gov/Archives/edgar/data/1060224/000090480206000008/sa10k306.htm. It returns a dataframe, of which one column - called "raw" - is HTML. It breaks up the HTML page into paragraphs, one row per paragraph:
other columns raw text First rowWe had a net loss of $1.55 million for the year ended December 31, 2016 and have an accumulated deficit of $61.5 million as of December 31, 2016. To achieve sustainable profitability, we must generate increased revenue.
We had a net loss of $1.55 million for the year ended December 31, 2016 and have an accumulated deficit of $61.5 million as of December 31, 2016. To achieve sustainable profitability, we must generate increased revenue.
Second row
We have a history of losses, and we cannot assure you that we will achieve profitability.
We have a history of losses, and we cannot assure you that we will achieve profitability.
You can easily replicate an example dataframe by running
...ANSWER
Answered 2021-Mar-23 at 20:30I've read your related questions here on SO. Interesting work! I believe the solution is somewhere along the lines of:
1: Extract the relevant words from the HTML by doing what you're already doing
QUESTION
I am unable to load Groceries data set in R.
Can anyone help?
...ANSWER
Answered 2021-Mar-18 at 10:25Groceries is in the arules package.
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install edgar
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page