本教程将教我们如何在 Beautifulsoup 中获取 <script> 标签和 <script> 内容。
内容
获取所有脚本标签
要获取所有脚本标签,我们需要使用find_all()函数
让我们看一个例子。
from bs4 import BeautifulSoup # Import BeautifulSoup module
# ? HTML Source
html = '''
<head>
<script src="/static/js/prism.js"></script>
<script src="/static/js/bootstrap.bundle.min.js"></script>
<script src="/static/js/main.js"></script>
<script> console.log('Hellow BeautifulSoup') </script>
</head>
'''
soup = BeautifulSoup(html, 'html.parser') # ?️ Parsing
scripts = soup.find_all("script") # ?️ Find all script tags
print(scripts) # ?️ Print Result
输出:
[<script src="/static/js/prism.js"></script>, <script src="/static/js/bootstrap.bundle.min.js"></script>, <script src="/static/js/main.js"></script>, <script> console.log('Hellow BeautifulSoup') </script>]
如您所见,我们将脚本标签作为list。现在让我们一一打印出来。
for script in scripts: # ?️ Loop Over scripts
print(script)
输出:
<script src="/static/js/prism.js"></script>
<script src="/static/js/bootstrap.bundle.min.js"></script>
<script src="/static/js/main.js"></script>
<script> console.log('Hellow BeautifulSoup') </script>
获取脚本文件附带的脚本标签
要仅获取脚本文件附带的脚本标签,我们需要:
- 使用 find_all() 函数
- 设置 src=True参数
例子:
# ? HTML Source
html = '''
<head>
<script src="/static/js/prism.js"></script>
<script src="/static/js/bootstrap.bundle.min.js"></script>
<script src="/static/js/main.js"></script>
<script> console.log('Hellow BeautifulSoup') </script>
</head>
'''
soup = BeautifulSoup(html, 'html.parser') # ?️ Parsing
scripts = soup.find_all("script", src=True) # ?️ Find all script tags that come with the src attribute
print(scripts) # ?️ Print Result
输出:
[<script src="/static/js/prism.js"></script>, <script src="/static/js/bootstrap.bundle.min.js"></script>, <script src="/static/js/main.js"></script>]
要获取脚本的 src属性,请遵循以下代码。
# Get src attribute
for script in scripts: # ?️ Loop Over scripts
print(script['src'])
输出:
/static/js/prism.js
/static/js/bootstrap.bundle.min.js
/static/js/main.js
如您所见,我们使用[‘src’]来获取脚本标签的 src URL。
获取脚本标签的内容
要获取脚本标签的内容,我们需要使用.string属性。但是,让我们看一个例子:
# ? HTML Source
html = '''
<head>
<script src="/static/js/prism.js"></script>
<script src="/static/js/bootstrap.bundle.min.js"></script>
<script src="/static/js/main.js"></script>
<script> console.log('Hellow BeautifulSoup') </script>
</head>
'''
soup = BeautifulSoup(html, 'html.parser') # ?️ Parsing
scripts = soup.find_all("script", string=True) # ?️ Find all script tags
print(scripts) # ?️ Print Result
输出:
[<script> console.log('Hellow BeautifulSoup') </script>]
我们设置 了string=True来查找所有有内容的脚本标签。现在我们将打印脚本标签的内容。
# Get content of script
for script in scripts: # ?️ Loop Over scripts
print(script.string)
输出:
console.log('Hellow BeautifulSoup')
结论
在 Beautifulsoup 主题中,我们学习了如何获取所有脚本标签。此外,我们还学习了如何获取脚本标签的 src 属性和内容。