omz:forum

    • Register
    • Login
    • Search
    • Recent
    • Popular

    Welcome!

    This is the community forum for my apps Pythonista and Editorial.

    For individual support questions, you can also send an email. If you have a very short question or just want to say hello — I'm @olemoritz on Twitter.


    Quick hackish html-book documentation scraper for offline reading

    Pythonista
    4
    8
    6349
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • paultopia
      paultopia last edited by paultopia

      You know what's annoying? When people post stuff online as html-books, like with a table of contents page and then a bunch of linked sub-pages with all the content. Lots of documentation, in particular, is organized that way (example: http://tldp.org/LDP/abs/html/ ) and it drives me nuts because I like to read stuff offline, like on my iPad on airplanes.

      Solution: a script to crawl such pages starting from the ToC and scrape unique URLs linked therein, one level deep, and append them to one big file that can then be read offline.

      Gist: https://gist.github.com/paultopia/460acfda07f9ca7314e5

      Takes URL of ToC page from raw_input, deposits a html file in pythonista's internal file system. From there, do with it what you will---pass it to a Dropbox upload script, pass it to docverter or something to make it into a PDF, whev. You can also use pythonista export function to get it into Dropbox or another app (like PDF converter) the easy way, but, for some odd reason, the export function only works when the filename ends with .py (why is this anyway?), so you'll have to edit the filename then edit back.

      Caveats: assumes all links are relative urls on same server, and has exactly zero validation to check that (easy to add, I just haven't bothered). Will probably crash if that assumption is violated. Also produces invalid html but not in any way that will bother any browser or converter. Finally, assumes that documents to scrape are structured with content in vanilla html, no Ajax calls or the like.

      1 Reply Last reply Reply Quote 2
      • Webmaster4o
        Webmaster4o last edited by

        The part where it doesn't check that URLs are relative does seem like an issue, you made an entire web scraper ;) This sounds really cool though.

        1 Reply Last reply Reply Quote 0
        • paultopia
          paultopia last edited by

          Heh yeah I'm about to add a little validation just for sanity-preserving purposes.

          I also just updated so it can handle ToC pages other than index.html or equivalent.

          1 Reply Last reply Reply Quote 0
          • paultopia
            paultopia last edited by

            Improved! There's a new and much more effective version of the script that:

            1. Confirms links come from same domain
            2. Better handles URLS relative to root rather than to ToC folder.

            Gist: https://gist.github.com/paultopia/02ca124a111a70faf174

            1 Reply Last reply Reply Quote 0
            • ccc
              ccc last edited by

              @paultopia would you be willing to make this a reop instead of a gist? I have some pull requests in mind.

              1 Reply Last reply Reply Quote 1
              • paultopia
                paultopia last edited by

                Absolutely @ccc -- here's a full-fledged repo: https://github.com/paultopia/spideyscrape

                PRs welcome!

                Also, I've refactored a little to make the code a bit more modular, and also to produce technically valid html.

                1 Reply Last reply Reply Quote 1
                • paultopia
                  paultopia last edited by

                  FYI, I've tossed up a quick python3 compatible version in a gist for the latest version of pythonista. (Next steps, a proper repo with a version that can handle 2 or 3, plus hopefully/maybe/one day a way to grab images and include in resulting html.)

                  https://gist.github.com/paultopia/39cb21e080b4abe24de8056e92a40ed2

                  1 Reply Last reply Reply Quote 0
                  • MartinPacker
                    MartinPacker last edited by

                    Shouldn't it recurse to the boundary of the domain?

                    i.e. Relative links are in, as are ones for the same domain name. But ones outside of the domain aren't.

                    Not having examined the code I don't know if you break cycles, also.

                    1 Reply Last reply Reply Quote 0
                    • First post
                      Last post
                    Powered by NodeBB Forums | Contributors