Skip to main content

URL

This example covers how to load HTML documents from a list of URLs into the Document format that we can use downstream.

Unstructured URL Loader

You have to install the unstructured library:

!pip install -U unstructured
from langchain_community.document_loaders import UnstructuredURLLoader
API Reference:UnstructuredURLLoader
urls = [
"https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-8-2023",
"https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-9-2023",
]

Pass in ssl_verify=False with headers=headers to get past ssl_verification error.

loader = UnstructuredURLLoader(urls=urls)
data = loader.load()

Selenium URL Loader

This covers how to load HTML documents from a list of URLs using the SeleniumURLLoader.

Using Selenium allows us to load pages that require JavaScript to render.

To use the SeleniumURLLoader, you have to install selenium and unstructured.

!pip install -U selenium unstructured
from langchain_community.document_loaders import SeleniumURLLoader
API Reference:SeleniumURLLoader
urls = [
"https://www.youtube.com/watch?v=dQw4w9WgXcQ",
"https://goo.gl/maps/NDSHwePEyaHMFGwh8",
]
loader = SeleniumURLLoader(urls=urls)
data = loader.load()

Playwright URL Loader

This covers how to load HTML documents from a list of URLs using the PlaywrightURLLoader.

Playwright enables reliable end-to-end testing for modern web apps.

As in the Selenium case, Playwright allows us to load and render the JavaScript pages.

To use the PlaywrightURLLoader, you have to install playwright and unstructured. Additionally, you have to install the Playwright Chromium browser:

!pip install -U playwright unstructured
!playwright install
from langchain_community.document_loaders import PlaywrightURLLoader
API Reference:PlaywrightURLLoader
urls = [
"https://www.youtube.com/watch?v=dQw4w9WgXcQ",
"https://goo.gl/maps/NDSHwePEyaHMFGwh8",
]
loader = PlaywrightURLLoader(urls=urls, remove_selectors=["header", "footer"])
data = loader.load()

Was this page helpful?


You can leave detailed feedback on GitHub.