URLExtract | python class for collecting URLs | Scraper library

 by   lipoja Python Version: 1.9.0 License: MIT

kandi X-RAY | URLExtract Summary

kandi X-RAY | URLExtract Summary

URLExtract is a Python library typically used in Automation, Scraper applications. URLExtract has no bugs, it has no vulnerabilities, it has build file available, it has a Permissive License and it has low support. You can install using 'pip install URLExtract' or download it from GitHub, PyPI.

URLExtract is python class for collecting (extracting) URLs from given text based on locating TLD.
Support
    Quality
      Security
        License
          Reuse

            kandi-support Support

              URLExtract has a low active ecosystem.
              It has 208 star(s) with 60 fork(s). There are 7 watchers for this library.
              There were 1 major release(s) in the last 12 months.
              There are 20 open issues and 62 have been closed. On average issues are closed in 116 days. There are 1 open pull requests and 0 closed requests.
              It has a neutral sentiment in the developer community.
              The latest version of URLExtract is 1.9.0

            kandi-Quality Quality

              URLExtract has 0 bugs and 0 code smells.

            kandi-Security Security

              URLExtract has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.
              URLExtract code analysis shows 0 unresolved vulnerabilities.
              There are 0 security hotspots that need review.

            kandi-License License

              URLExtract is licensed under the MIT License. This license is Permissive.
              Permissive licenses have the least restrictions, and you can use them in most projects.

            kandi-Reuse Reuse

              URLExtract releases are available to install and integrate.
              Deployable package is available in PyPI.
              Build file is available. You can build the component from source.
              It has 1121 lines of code, 76 functions and 17 files.
              It has high code complexity. Code complexity directly impacts maintainability of the code.

            Top functions reviewed by kandi - BETA

            kandi has reviewed URLExtract and discovered the below as its top functions. This is intended to give you an instant insight into URLExtract implemented functionality, and help decide if they suit your requirements.
            • Return command line arguments for urls
            • Extract URLs from given text
            • Generates URLs from a given string
            • Complete URL
            • Returns the default cache file path
            Get all kandi verified functions for this library.

            URLExtract Key Features

            No Key Features are available at this moment for URLExtract.

            URLExtract Examples and Code Snippets

            No Code Snippets are available at this moment for URLExtract.

            Community Discussions

            QUESTION

            How do I append a dataframe to another dataframe by passing it to a function, and returning a dataframe containing both?
            Asked 2021-Oct-01 at 22:53
             BASE_DIR = "C:\\Users\\blah\\Desktop\\blah\\blah\\"
            
            
            
            
            url = "https://otexa.trade.gov/scripts/tqmon2.exe/catdata"
            url2 = "https://otexa.trade.gov/scripts/tqmon2.exe/htsdata"
            #Define header information for POST call to url
            headers = CaseInsensitiveDict()
            headers["Accept"] = "application/json"
            headers["Content-Type"] = "application/json"
            
            cats = ['340','341']
            #cats = ['340','341','640','641','347','348','647','648','338','339','638','639','345','645','445','634','635','352','652','349','649','1','31','61']
            
            cats2 = ['6203424511','6204628011','6215100040','6215200000','4202221500']
            
            
            dictionary_of_different_sites_to_scrape = {'catdata': {
                'url': "https://otexa.trade.gov/scripts/tqmon2.exe/catdata",
                'cat': '340',
                'delimit': 'ssinvert',
                'years': ['2020','2021']
                },
                'htsdata': {
                'url': "https://otexa.trade.gov/scripts/tqmon2.exe/htsdata",
                'cat': '6203424511',
                'delimit': 'ss',
                'years': ['2020', '2021']
            }
            }
            
            
            
            def post_request(param_info): 
                headers = CaseInsensitiveDict()
                headers["Accept"] = "application/json"
                headers["Content-Type"] = "application/json"
                resp = requests.post(url, data=param_info, headers=headers)  # Send the POST request
                #print(resp.text) #Test the function's output
                return resp.text
            
            def url_stripper(text):
                extractor = URLExtract() #Initialize a URL extractor
                urls = extractor.find_urls(text) #Get that URL!
                link = urls[0]
                #print(link)
                return link
            
            def filename_splitter(link):
                if link.find('/'):
                    filename = link.rsplit('/', 1)  # Get filename from URL
                    #print(filename[1])
                return filename[1]
                
            def save_xls_downloads(filename, link):
                # Sends GET request to url for xls download
                r = requests.get(link, allow_redirects=True)
                # Writes the content of the xls to the local drive
                file = open(filename, 'wb')
                file.write(r.content)
                file.close()
                print(filename, "written to folder.")
                return filename
            
            def fix_downloaded_xls(filename, code):
                #This section is required due to Otexa having codec issues in the xls files
                excel = win32.gencache.EnsureDispatch(
                    'Excel.Application')  # Initializes an instance of Excel
                # Tell excel to open the file
                wb = excel.Workbooks.Open(BASE_DIR + filename)
            
                wb.SaveAs("C:\\Users\\blah\\Desktop\\blah\\blah\\" +
                          "category - " + code + ".xlsx", FileFormat=51)  # Changes the file from xls to xlsx
            
                wb.Close()  # Closes the workbook
                excel.Application.Quit()  # Closes the instance of Excel
                return print("Converted", filename, "to category -",code,".xlsx")
            
            def delete_old_xls(filename):
                #Remove old xls file
                remove_old_file = pathlib.Path(BASE_DIR + filename)
                remove_old_file.unlink()
                return print("Deleted", filename)
            
            def clean_new_xlsx(code):
                # Takes new xlsx file, loads as dataframe, skips the first 4 rows, and assigns the new first row as header
                if len(code) > 3:
                    rows_to_skip = 5
                else:
                    rows_to_skip = 4
                output = pd.read_excel("category - " + code + ".xlsx", skiprows=rows_to_skip, header=0)
                # Tons of NAN values, so replacing with zeroes
                output.fillna(0, inplace=True)
                #print(output.head(2))
                #final_table = final_table.append(output)
                print("Category -", code, ".xlsx has been cleaned and made into dataframe")
                return output
            
            def append_to_report_table(cleaned_dataframe, report_table):
                #print(cleaned_dataframe.head())
                report = report_table.append(cleaned_dataframe, ignore_index=True)
                report_table = report
                #report_table = report_table.rename(columns=report_table.iloc[0])
                #print(report_table.shape)
                #print(report_table.head())
                print("DataFrame appended to report_table.")
                print(report_table.shape)
                return report_table
            
            def save_report_table(site_id, report):
                print(report.shape)
                filename = "C:\\Users\\blah\\Desktop\\blah\\blah\\complete report.xlsx"
                #print(report.info())
                #print(report.head())
                # Drop any rows that the country is equal to 0, which means they were BLANK
                report.drop(report[report['Country'] == 0].index, inplace=True)
                report = report.reset_index()  # Resets the index to get a true count
                report.to_excel(filename)
                print("End")
                return print("Saved complete", site_id, "report.")
            
            for site_id, param_info in dictionary_of_different_sites_to_scrape.items():
            
                if len(param_info['cat']) > 3:
                    hts_report = pd.DataFrame()
                    for cat in cats2:
                        param_info['cat'] = cat
                        text = post_request(param_info)
                        link = url_stripper(text)
                        filename = filename_splitter(link)
                        save_xls = save_xls_downloads(filename, link)
                        fix_downloaded_xls(save_xls, param_info['cat'])
                        delete_old_xls(filename)
                        clean_xlsx = clean_new_xlsx(param_info['cat'])
                        append_report = append_to_report_table(clean_xlsx, hts_report)
                    save_report_table(site_id, report_table)
                else:
                    report_table = pd.DataFrame()
                    for cat in cats:
                        param_info['cat'] = cat
                        post_request(param_info)
                        text = post_request(param_info)
                        url_stripper(text)
                        link = url_stripper(text)
                        filename = filename_splitter(link)
                        save_xls = save_xls_downloads(filename, link)
                        fix_downloaded_xls(save_xls, param_info['cat'])
                        delete_old_xls(filename)
                        clean_xlsx = clean_new_xlsx(param_info['cat'])
                        append_to_report_table(clean_xlsx, report_table)
                    save_report_table(site_id, report_table)
                
            
            ...

            ANSWER

            Answered 2021-Oct-01 at 22:53

            The answer to your question is pretty simple.

            If you have two datasets, "A" and "B" you can append "A" to "B" or "B" to "A" by using this code:

            Source https://stackoverflow.com/questions/69412555

            Community Discussions, Code Snippets contain sources that include Stack Exchange Network

            Vulnerabilities

            No vulnerabilities reported

            Install URLExtract

            You can install using 'pip install URLExtract' or download it from GitHub, PyPI.
            You can use URLExtract like any standard Python library. You will need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, and git installed. Make sure that your pip, setuptools, and wheel are up to date. When using pip it is generally recommended to install packages in a virtual environment to avoid changes to the system.

            Support

            For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .
            Find more information at:

            Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items

            Find more libraries
            Install
          • PyPI

            pip install urlextract

          • CLONE
          • HTTPS

            https://github.com/lipoja/URLExtract.git

          • CLI

            gh repo clone lipoja/URLExtract

          • sshUrl

            git@github.com:lipoja/URLExtract.git

          • Stay Updated

            Subscribe to our newsletter for trending solutions and developer bootcamps

            Agree to Sign up and Terms & Conditions

            Share this Page

            share link