omz:forum

    • Register
    • Login
    • Search
    • Recent
    • Popular

    Welcome!

    This is the community forum for my apps Pythonista and Editorial.

    For individual support questions, you can also send an email. If you have a very short question or just want to say hello — I'm @olemoritz on Twitter.


    Scraping Test

    Pythonista
    6
    18
    9977
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • reefboy1
      reefboy1 last edited by

      I am new to scraping and I made this

      
      import bs4, requests
      
      def get_beautiful_soup(url):
          return bs4.BeautifulSoup(requests.get(url).text)
      
      soup = get_beautiful_soup('http://www.python.org')
      
      print(soup.prettify())
      
      

      Anything else that I can do with this?

      1 Reply Last reply Reply Quote 0
      • ccc
        ccc last edited by

        Print all http and https links on the page sorted alphabetically with no duplicates...

        links = []
        for anchor in soup.find_all('a'):  # tags like: <a href='http...'
            try:
                link = anchor['href']
                if link.startswith('http'):
                    links.append(link)
            except KeyError:
                pass
        print('\n'.join(sorted(set(links))))
        

        The bs4 documentation examples are quite fun to read and run in Pythonista.

        1 Reply Last reply Reply Quote 0
        • ccc
          ccc last edited by

          Another example... We just want the text of the webpage without all the HTML tags, etc.

          print('=' * 40)  # print the text of the body of the webpage without all  the html junk
          print(soup.body.get_text())  # get the body of the soup, and then get only the text of that
          print('=' * 40)  # contains lots of blank lines... let's get rid of the blank lines
          for line in soup.body.get_text().splitlines():
              if line.strip():
                  print(line)
          print('=' * 40)  # another way to write the three previous lines uses a list comprehension and str.join()
          print('\n'.join([x for x in soup.body.get_text().splitlines() if x.strip()]))
          print('=' * 40)  # contains lots of lines that have indentation... left justify all lines
          for line in soup.body.get_text().splitlines():
              if line.strip():
                  print(line.lstrip())
          print('=' * 40)  # rewritten using a list comprehension and str.join()
          print('\n'.join([x.lstrip() for x in soup.body.get_text().splitlines() if x.strip()]))
          
          1 Reply Last reply Reply Quote 0
          • ccc
            ccc last edited by

            Does someone else want to do the next one?

            Print out the URLs for all the images that appear on the page.

            1 Reply Last reply Reply Quote 0
            • brumm
              brumm last edited by

              find_all_pictures.py

              1 Reply Last reply Reply Quote 0
              • ccc
                ccc last edited by

                Really nice, @brumm...

                It works even better if you set the url to http://amazon.com or http://imdb.com. The Python.org page handles images differently but I like your solution best.

                1 Reply Last reply Reply Quote 0
                • reefboy1
                  reefboy1 last edited by

                  I have made something new! I will post it soon

                  1 Reply Last reply Reply Quote 0
                  • brumm
                    brumm last edited by

                    download_pictures

                    1 Reply Last reply Reply Quote 0
                    • reefboy1
                      reefboy1 last edited by

                      
                      import bs4, requests
                      import webbrowser
                      import console
                      
                      def get_beautiful_soup(url):
                          return bs4.BeautifulSoup(requests.get(url).text)
                      a = raw_input('url to check. url structure (http://www.url.com or net or gov or org)     ')
                      console.clear()
                      soup = get_beautiful_soup(a)
                      webbrowser.open('http://google.com/gmail')
                      
                      print(soup.prettify())
                      
                      
                      
                      1 Reply Last reply Reply Quote 0
                      • JonB
                        JonB last edited by

                        @brumm
                        filname=posix.basename(urlparse.urlsplit(url)[2])

                        Might be a little more robust way to get the filename. Technically the url could contain a query fragment, etc, which gets stripped out by urlsplit.

                        1 Reply Last reply Reply Quote 0
                        • dgelessus
                          dgelessus last edited by

                          No, bad! Don't use the posix module directly. It's an undocumented internal module that provides implementations for some of the os module's functions on Unix-based systems. In almost all cases you'll want to use os, which provides all the functions that posix does, but is guaranteed to be available on all platforms.

                          1 Reply Last reply Reply Quote 0
                          • reefboy1
                            reefboy1 last edited by

                            What else can I do with bs4? I only know how to get html

                            1 Reply Last reply Reply Quote 0
                            • ccc
                              ccc last edited by

                              So far we have demonstrated how to get:

                              • The body text only with no markup
                              • The URLs that are linked to
                              • Download all the images

                              What else do web pages contain? Sounds, Movies, Forms, Lists, Files, others?

                              What else are you interested to get from web pages? Music lists? Tour schedules for your favorite band? Local weather forecast? Snow depth at local ski slopes? Wave heights at various beaches? What info does reefboy1 want to scrape off of webpages?

                              Have you gone thru the bs4 documentation examples yet?

                              1 Reply Last reply Reply Quote 0
                              • reefboy1
                                reefboy1 last edited by

                                Yes I have read them, but I'm not sure I fully understand them. Wether forecast would be nice

                                1 Reply Last reply Reply Quote 0
                                • reefboy1
                                  reefboy1 last edited by

                                  Ps: I don't know HTML

                                  1 Reply Last reply Reply Quote 0
                                  • wradcliffe
                                    wradcliffe last edited by

                                    The current examples have been great for picking out generic things on web pages but the typical use case needs to understand the page format and then get something specific off the page like a table of data. You typically need to use the pretty print facility to view the page contents and then issue a series of tedious calls to march down through the structure and pull out the text. The existing bs4 docs and examples are fine for that.

                                    You can get real far with a very specific problem and code up something that grabs what you need off a page using these hand coded methods easily but the code is proabably going to be throwaway and break as soon as the target url changes the format of their pages.

                                    However ... this is NOT what a true "ScreenScraper" app does. True ScreenScrapers build on bs4 and provide templates for various kinds of web pages and automatically strips out all the crap like inline ads and sidebars and all the "visual cruft" while returning only the "content". The best ones can identify the main images that the page is refering to and the body of the text that is the main subject of the page. There is a project/product called "readability" that did this that was ported to Python but they stopped updating it when it became a "commercial" a product. You can still find the early Python code though that uses a version of bs at: https://github.com/gfxmonk/python-readability

                                    The big players in this area are obviously the search engine companies like Google, Bing, Facebook that have extremely sophisicated methods for disecting web pages and getting at the real "content". The rest of the universe seems to have moved on to using an web API instead that just hands you the required data using a special URL syntax.

                                    1 Reply Last reply Reply Quote 0
                                    • brumm
                                      brumm last edited by

                                      @JonB: Great, thank you.

                                      1 Reply Last reply Reply Quote 0
                                      • JonB
                                        JonB last edited by

                                        Dgelessus,
                                        My bad... that should be os.path.basename instead. I wasn't sure how windows machines would handle forward slashes in urls, but a quick test shows it works properly.

                                        1 Reply Last reply Reply Quote 0
                                        • First post
                                          Last post
                                        Powered by NodeBB Forums | Contributors