URLExtract | python class for collecting URLs | Scraper library

by lipoja Python Version: 1.9.0 License: MIT

X-Ray Key Features Code Snippets Community Discussions(1)Vulnerabilities Install Support

kandi X-RAY | URLExtract Summary

URLExtract is a Python library typically used in Automation, Scraper applications. URLExtract has no bugs, it has no vulnerabilities, it has build file available, it has a Permissive License and it has low support. You can install using 'pip install URLExtract' or download it from GitHub, PyPI.

URLExtract is python class for collecting (extracting) URLs from given text based on locating TLD.

Support

Quality

Security

License

Reuse

Support

URLExtract has a low active ecosystem.

It has 208 star(s) with 60 fork(s). There are 7 watchers for this library.

It had no major release in the last 12 months.

There are 20 open issues and 62 have been closed. On average issues are closed in 116 days. There are 1 open pull requests and 0 closed requests.

It has a neutral sentiment in the developer community.

The latest version of URLExtract is 1.9.0

Quality

URLExtract has 0 bugs and 0 code smells.

Security

URLExtract has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

URLExtract code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

URLExtract is licensed under the MIT License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

URLExtract releases are available to install and integrate.

Deployable package is available in PyPI.

Build file is available. You can build the component from source.

It has 1121 lines of code, 76 functions and 17 files.

It has high code complexity. Code complexity directly impacts maintainability of the code.

Top functions reviewed by kandi - BETA

kandi has reviewed URLExtract and discovered the below as its top functions. This is intended to give you an instant insight into URLExtract implemented functionality, and help decide if they suit your requirements.

Return command line arguments for urls
Extract URLs from given text
Generates URLs from a given string
Complete URL
Returns the default cache file path

Get all kandi verified functions for this library.

URLExtract Key Features

No Key Features are available at this moment for URLExtract.

URLExtract Examples and Code Snippets

No Code Snippets are available at this moment for URLExtract.

Community Discussions

Trending Discussions on URLExtract

How do I append a dataframe to another dataframe by passing it to a function, and returning a dataframe containing both?

QUESTION

How do I append a dataframe to another dataframe by passing it to a function, and returning a dataframe containing both?

Asked 2021-Oct-01 at 22:53

 BASE_DIR = "C:\\Users\\blah\\Desktop\\blah\\blah\\"




url = "https://otexa.trade.gov/scripts/tqmon2.exe/catdata"
url2 = "https://otexa.trade.gov/scripts/tqmon2.exe/htsdata"
#Define header information for POST call to url
headers = CaseInsensitiveDict()
headers["Accept"] = "application/json"
headers["Content-Type"] = "application/json"

cats = ['340','341']
#cats = ['340','341','640','641','347','348','647','648','338','339','638','639','345','645','445','634','635','352','652','349','649','1','31','61']

cats2 = ['6203424511','6204628011','6215100040','6215200000','4202221500']


dictionary_of_different_sites_to_scrape = {'catdata': {
    'url': "https://otexa.trade.gov/scripts/tqmon2.exe/catdata",
    'cat': '340',
    'delimit': 'ssinvert',
    'years': ['2020','2021']
    },
    'htsdata': {
    'url': "https://otexa.trade.gov/scripts/tqmon2.exe/htsdata",
    'cat': '6203424511',
    'delimit': 'ss',
    'years': ['2020', '2021']
}
}



def post_request(param_info): 
    headers = CaseInsensitiveDict()
    headers["Accept"] = "application/json"
    headers["Content-Type"] = "application/json"
    resp = requests.post(url, data=param_info, headers=headers)  # Send the POST request
    #print(resp.text) #Test the function's output
    return resp.text

def url_stripper(text):
    extractor = URLExtract() #Initialize a URL extractor
    urls = extractor.find_urls(text) #Get that URL!
    link = urls[0]
    #print(link)
    return link

def filename_splitter(link):
    if link.find('/'):
        filename = link.rsplit('/', 1)  # Get filename from URL
        #print(filename[1])
    return filename[1]
    
def save_xls_downloads(filename, link):
    # Sends GET request to url for xls download
    r = requests.get(link, allow_redirects=True)
    # Writes the content of the xls to the local drive
    file = open(filename, 'wb')
    file.write(r.content)
    file.close()
    print(filename, "written to folder.")
    return filename

def fix_downloaded_xls(filename, code):
    #This section is required due to Otexa having codec issues in the xls files
    excel = win32.gencache.EnsureDispatch(
        'Excel.Application')  # Initializes an instance of Excel
    # Tell excel to open the file
    wb = excel.Workbooks.Open(BASE_DIR + filename)

    wb.SaveAs("C:\\Users\\blah\\Desktop\\blah\\blah\\" +
              "category - " + code + ".xlsx", FileFormat=51)  # Changes the file from xls to xlsx

    wb.Close()  # Closes the workbook
    excel.Application.Quit()  # Closes the instance of Excel
    return print("Converted", filename, "to category -",code,".xlsx")

def delete_old_xls(filename):
    #Remove old xls file
    remove_old_file = pathlib.Path(BASE_DIR + filename)
    remove_old_file.unlink()
    return print("Deleted", filename)

def clean_new_xlsx(code):
    # Takes new xlsx file, loads as dataframe, skips the first 4 rows, and assigns the new first row as header
    if len(code) > 3:
        rows_to_skip = 5
    else:
        rows_to_skip = 4
    output = pd.read_excel("category - " + code + ".xlsx", skiprows=rows_to_skip, header=0)
    # Tons of NAN values, so replacing with zeroes
    output.fillna(0, inplace=True)
    #print(output.head(2))
    #final_table = final_table.append(output)
    print("Category -", code, ".xlsx has been cleaned and made into dataframe")
    return output

def append_to_report_table(cleaned_dataframe, report_table):
    #print(cleaned_dataframe.head())
    report = report_table.append(cleaned_dataframe, ignore_index=True)
    report_table = report
    #report_table = report_table.rename(columns=report_table.iloc[0])
    #print(report_table.shape)
    #print(report_table.head())
    print("DataFrame appended to report_table.")
    print(report_table.shape)
    return report_table

def save_report_table(site_id, report):
    print(report.shape)
    filename = "C:\\Users\\blah\\Desktop\\blah\\blah\\complete report.xlsx"
    #print(report.info())
    #print(report.head())
    # Drop any rows that the country is equal to 0, which means they were BLANK
    report.drop(report[report['Country'] == 0].index, inplace=True)
    report = report.reset_index()  # Resets the index to get a true count
    report.to_excel(filename)
    print("End")
    return print("Saved complete", site_id, "report.")

for site_id, param_info in dictionary_of_different_sites_to_scrape.items():

    if len(param_info['cat']) > 3:
        hts_report = pd.DataFrame()
        for cat in cats2:
            param_info['cat'] = cat
            text = post_request(param_info)
            link = url_stripper(text)
            filename = filename_splitter(link)
            save_xls = save_xls_downloads(filename, link)
            fix_downloaded_xls(save_xls, param_info['cat'])
            delete_old_xls(filename)
            clean_xlsx = clean_new_xlsx(param_info['cat'])
            append_report = append_to_report_table(clean_xlsx, hts_report)
        save_report_table(site_id, report_table)
    else:
        report_table = pd.DataFrame()
        for cat in cats:
            param_info['cat'] = cat
            post_request(param_info)
            text = post_request(param_info)
            url_stripper(text)
            link = url_stripper(text)
            filename = filename_splitter(link)
            save_xls = save_xls_downloads(filename, link)
            fix_downloaded_xls(save_xls, param_info['cat'])
            delete_old_xls(filename)
            clean_xlsx = clean_new_xlsx(param_info['cat'])
            append_to_report_table(clean_xlsx, report_table)
        save_report_table(site_id, report_table)

...

ANSWER

Answered 2021-Oct-01 at 22:53

The answer to your question is pretty simple.

If you have two datasets, "A" and "B" you can append "A" to "B" or "B" to "A" by using this code:

Source https://stackoverflow.com/questions/69412555

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install URLExtract

You can install using 'pip install URLExtract' or download it from GitHub, PyPI.
You can use URLExtract like any standard Python library. You will need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, and git installed. Make sure that your pip, setuptools, and wheel are up to date. When using pip it is generally recommended to install packages in a virtual environment to avoid changes to the system.

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: