Article extraction and lxml related question
wradcliffe last edited by
I was just looking at various projects doing Article extraction in python and wonder what the various modules in Pythonista are intended to support. I keep running into projects that require lxml or they say that they run better with lxml. This library has a lot of C code in it, so it is not yet been included in Pythonista. This seems to cut out most projects. I know that we have bs4 and maybe can install NLTK which can form the basis of a decent Article extraction system. I can see that there are other modules in Pythonista that can do HTML parsing or conversion to text. There is also the option of using Readability and handing the whole extraction process over to a web service.
One of the reasons this may not be a big priority in Pythonista is that most uses of this tech are for fairly large scale webcrawling of lots of sites which obviously makes no sense on an iPad or iPhone. I am interested in this more for automating manual copy paste workflows involving a few dozen to at most 100 websites.
Also - there may be a better way to "screen scrape" using Pythonista that is not obvious. If it were possible to just open the page in Safari and do something like a double touch on the center of the screen and then screen capture and OCR the result to get the text, you could avoid all of the parsing nonsense altogether. After all - that's what the original term "screen scraping" meant.