使用BeautifulSoup访问嵌套元素

Question

我想要找寻所有嵌套在

元素。我已经尝试了以下几种方法，但它们都返回了0条消息：

messages = soup.find_all("ol")
messages = soup.find_all('div', class_='messageContent')
messages = soup.find_all("li")
messages = soup.select('ol > li')
messages = soup.select('.messageList > li')

完整的HTML内容可以在这个gist中查看。

我有两点疑问：

正确获取这些列表项的方法是什么？
在BeautifulSoup中，是否必须知道元素的嵌套路径才能获取到目标元素？或者说，使用类似 soup.find_all("li") 的方式本应返回所有
元素，无论它们是否被嵌套？

我也欢迎不使用bs4的解决方案。

更新：
我是这样加载代码的：

from bs4 import BeautifulSoup

# 读取HTML内容
with open('/tmp/property.html', 'r', encoding='utf-8') as file:
    html_content = file.read()

# 创建BeautifulSoup对象并指定解析器
soup = BeautifulSoup(html_content, 'html.parser')

文件内容就是上面gist链接中的那个。

更新2：
我使用 requests 库解决了问题。看起来手动下载文件可能导致部分HTML结构损坏了？

import requests
from bs4 import BeautifulSoup

url = "https://www.propertychat.com.au/community/threads/melbourne-property-market-2024.75213/"
response = requests.get(url)

soup = BeautifulSoup(response.text, "html.parser")
messages = soup.select('.messageList > li')

arshajii · Answer

感谢您提供了示例代码和数据。

此代码将会选取您想要的各种列表项元素：

def parse(in_file: Path) -> None:
    hrule = "=" * 60
    soup = BeautifulSoup(in_file.read_text(), "html.parser")

    ol = soup.find("ol", {"class": "messageList", "id": "messageList"})

    for li in ol.find_all("li"):
        print(f"

{hrule}

{li}")

您当然可以使用 soup.find_all("li") 这样的语句来查找所有

标签。
这会检索文档中的所有列表项，
即便它们位于其他

起初，我打算遍历所有的

通常，我会编写与文档元素嵌套结构相对应的嵌套循环，
但这并非必须。这样做只是让结果更容易理解，
因为您可以了解到元素来自于哪个容器，这样就有了上下文信息。

Joe · Answer

也许这就是你所需要的？

import requests as r
from bs4 import BeautifulSoup as bs

URL = "https://www.propertychat.com.au/community/threads/melbourne-property-market-2024.75213/"
page = r.get(URL)

soup_obj = bs(page.content, "html.parser")

results_object = soup_obj.find("ol")

li_list = [results_object.find_all("li")]

print(li_list)

这段代码使用 requests 和 bs4 来找到你提到的

元素，并将这些
元素存储在一个名为 li_list 的数组对象中，最后打印出 li_list 的内容。