omz:forum

    • Register
    • Login
    • Search
    • Recent
    • Popular

    Welcome!

    This is the community forum for my apps Pythonista and Editorial.

    For individual support questions, you can also send an email. If you have a very short question or just want to say hello — I'm @olemoritz on Twitter.


    Webpage Slices are Different from what is There

    Pythonista
    webpage slices re different fr m what is there
    3
    14
    4526
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • TomD
      TomD last edited by

      I downloaded a webpage successfuly which looks correct in content. When printing slices the characters are different.

      Specifically, when I print the first character it shows the first plus the next 3 characters and an apostrophe on the end. So one character becomes 5 characters.

      On printing longer slices of the webpage the number of characters is also greater and the apostrophe is always added on the end.

      What is happening?

      Tom

      cvp 2 Replies Last reply Reply Quote 0
      • cvp
        cvp @TomD last edited by

        @TomD could you post your code here?

        1 Reply Last reply Reply Quote 0
        • cvp
          cvp @TomD last edited by

          @TomD If you download with something like

          data = requests.get(url).content
          

          data is bytes and when you print it, you convert it to string as b'xxx'

          1 Reply Last reply Reply Quote 0
          • TomD
            TomD last edited by

            #The "with" statement overflows into the next line due to this narrow comment box

            import urllib.request
            tda=str
            with urllib.request.urlopen("https://www.asx.com.au/asx/statistics/todayAnns.do") as response
            tda=response.read()

            #Print the entire html string so I know what is in it
            #The output of this print statement starts:
            #b'\r\n\r\n\r\n<!DOCTYPE

            print (tda)

            #Separately print the first 5 characters in the html string
            #The output of this is, including spaces between items:
            #b'\r' b'\n' b'\r' b'\n' b'\r'

            print (tda[0:1],tda[1:2],tda[2:3],tda[3:4],tda[4:5])

            #Print the first 5 characters in the string
            #The output of this is:
            #b'\r\n\r\n\r'

            print(tda[0:5])

            1 Reply Last reply Reply Quote 0
            • TomD
              TomD last edited by

              CVP, so no printed string slice takes the html one character at a time. It combines them into groups and adds apostrophes.

              cvp 1 Reply Last reply Reply Quote 0
              • cvp
                cvp @TomD last edited by

                @TomD You can see the string is between b' '
                And characters with \ are not printable: ex: \n = next line
                Thus b'\n' is only one character "next line "

                1 Reply Last reply Reply Quote 0
                • TomD
                  TomD last edited by

                  Thanks CVP. That has me onto something.
                  I am data scraping. Maybe better off using a package like beautifulsoup?

                  cvp 1 Reply Last reply Reply Quote 0
                  • cvp
                    cvp @TomD last edited by

                    @TomD try this

                    st = tda.decode('utf8')
                    print(st)
                    

                    And you will see that there are empty lines at begin, which are \n

                    1 Reply Last reply Reply Quote 0
                    • TomD
                      TomD last edited by

                      It doesn't like
                      print (st)

                      cvp 1 Reply Last reply Reply Quote 0
                      • cvp
                        cvp @TomD last edited by

                        @TomD try this script

                        import urllib.request
                        with urllib.request.urlopen("https://www.asx.com.au/asx/statistics/todayAnns.do") as response:
                        	tda=response.read()
                        st = tda.decode('utf8')
                        print(st)
                        
                        1 Reply Last reply Reply Quote 0
                        • TomD
                          TomD last edited by

                          I see so I could work on that utf8 more easily

                          cvp 1 Reply Last reply Reply Quote 0
                          • cvp
                            cvp @TomD last edited by

                            @TomD st contains a string, thus yes, good luck

                            1 Reply Last reply Reply Quote 0
                            • TomD
                              TomD last edited by

                              I much appreciate. You have helped me around an obstacle

                              mikael 1 Reply Last reply Reply Quote 0
                              • mikael
                                mikael @TomD last edited by

                                @TomD, definitely recommend using BeautifulSoup or webview with Javascript. Latter especially if you are trying to scrape pages with dynamic content.

                                1 Reply Last reply Reply Quote 0
                                • First post
                                  Last post
                                Powered by NodeBB Forums | Contributors