I am new to scraping and I made this
import bs4, requests def get_beautiful_soup(url): return bs4.BeautifulSoup(requests.get(url).text) soup = get_beautiful_soup('http://www.python.org') print(soup.prettify())
Anything else that I can do with this?
Print all http and https links on the page sorted alphabetically with no duplicates...
links =  for anchor in soup.find_all('a'): # tags like: <a href='http...' try: link = anchor['href'] if link.startswith('http'): links.append(link) except KeyError: pass print('\n'.join(sorted(set(links))))
The bs4 documentation examples are quite fun to read and run in Pythonista.
Another example... We just want the text of the webpage without all the HTML tags, etc.
print('=' * 40) # print the text of the body of the webpage without all the html junk print(soup.body.get_text()) # get the body of the soup, and then get only the text of that print('=' * 40) # contains lots of blank lines... let's get rid of the blank lines for line in soup.body.get_text().splitlines(): if line.strip(): print(line) print('=' * 40) # another way to write the three previous lines uses a list comprehension and str.join() print('\n'.join([x for x in soup.body.get_text().splitlines() if x.strip()])) print('=' * 40) # contains lots of lines that have indentation... left justify all lines for line in soup.body.get_text().splitlines(): if line.strip(): print(line.lstrip()) print('=' * 40) # rewritten using a list comprehension and str.join() print('\n'.join([x.lstrip() for x in soup.body.get_text().splitlines() if x.strip()]))
Does someone else want to do the next one?
Print out the URLs for all the images that appear on the page.
I have made something new! I will post it soon
import bs4, requests import webbrowser import console def get_beautiful_soup(url): return bs4.BeautifulSoup(requests.get(url).text) a = raw_input('url to check. url structure (http://www.url.com or net or gov or org) ') console.clear() soup = get_beautiful_soup(a) webbrowser.open('http://google.com/gmail') print(soup.prettify())
JonB last edited by
Might be a little more robust way to get the filename. Technically the url could contain a query fragment, etc, which gets stripped out by urlsplit.
dgelessus last edited by
No, bad! Don't use the
posixmodule directly. It's an undocumented internal module that provides implementations for some of the
osmodule's functions on Unix-based systems. In almost all cases you'll want to use
os, which provides all the functions that
posixdoes, but is guaranteed to be available on all platforms.
What else can I do with bs4? I only know how to get html
So far we have demonstrated how to get:
- The body text only with no markup
- The URLs that are linked to
- Download all the images
What else do web pages contain? Sounds, Movies, Forms, Lists, Files, others?
What else are you interested to get from web pages? Music lists? Tour schedules for your favorite band? Local weather forecast? Snow depth at local ski slopes? Wave heights at various beaches? What info does reefboy1 want to scrape off of webpages?
Have you gone thru the bs4 documentation examples yet?
Yes I have read them, but I'm not sure I fully understand them. Wether forecast would be nice
Ps: I don't know HTML
wradcliffe last edited by
The current examples have been great for picking out generic things on web pages but the typical use case needs to understand the page format and then get something specific off the page like a table of data. You typically need to use the pretty print facility to view the page contents and then issue a series of tedious calls to march down through the structure and pull out the text. The existing bs4 docs and examples are fine for that.
You can get real far with a very specific problem and code up something that grabs what you need off a page using these hand coded methods easily but the code is proabably going to be throwaway and break as soon as the target url changes the format of their pages.
However ... this is NOT what a true "ScreenScraper" app does. True ScreenScrapers build on bs4 and provide templates for various kinds of web pages and automatically strips out all the crap like inline ads and sidebars and all the "visual cruft" while returning only the "content". The best ones can identify the main images that the page is refering to and the body of the text that is the main subject of the page. There is a project/product called "readability" that did this that was ported to Python but they stopped updating it when it became a "commercial" a product. You can still find the early Python code though that uses a version of bs at: https://github.com/gfxmonk/python-readability
The big players in this area are obviously the search engine companies like Google, Bing, Facebook that have extremely sophisicated methods for disecting web pages and getting at the real "content". The rest of the universe seems to have moved on to using an web API instead that just hands you the required data using a special URL syntax.
@JonB: Great, thank you.
JonB last edited by
My bad... that should be os.path.basename instead. I wasn't sure how windows machines would handle forward slashes in urls, but a quick test shows it works properly.