omz:forum

    • Register
    • Login
    • Search
    • Recent
    • Popular

    Welcome!

    This is the community forum for my apps Pythonista and Editorial.

    For individual support questions, you can also send an email. If you have a very short question or just want to say hello — I'm @olemoritz on Twitter.


    Try to read *pdf file

    Pythonista
    3
    6
    1109
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • DavinE
      DavinE last edited by DavinE

      Hello Guys,

      I have a Question/Problem...
      I‘m trying to read a *PDF file to get an Output like this:

      ['2374575', '1', 'ZB33S Zählerschr.,univ. Z,1100x800x205mm,SKII', 'ZB33S', 'Ausgeliefert', '272.89', '0']
      ['2081086', '1', 'FZ443N Kabelrangierkanal,3-feldig,aufsteckbar', 'FZ443N', 'Ausgeliefert', '50.49', '0']
      

      I only get the Full text in one..... and i don‘t know how to separat it...
      In the Exempel file i only need the Pos. 1-6.

      This is the Output i get:

      Auftrags-/RechnungsauskunftDatum Von 14.09.2021Datum Bis 14.11.2021Artikelnummer Artikel-Volltextsuche Auftragsnummer Bestellnummer AufträgeNummerDatumStatusObjektKommissionBestellangabenBestell-Nr.4919825 / 113.10.2021erfasst95463035 Bauer, ThorstenSeite 1 von 1PosAusschr.-PosArtikel-Nr.Verband-Nr.BezeichnungBestelltLieferbarRückstandEinzelpreisGesamtpreis128931232893123DEHN Kombi-Ableiter Typ1+2 DEHNsh 909340 ZP
      Basic 2 für TN-S-Systeme110181,95181,95228400782840078HAGE vector AP-Kleinverteiler    VE312DN
      IP65,3-reihig, 36TE, Rangierkanal110118,53118,53335072223507222HAGE Leitungsschutzschalter C16A  MCN316 6kA
      3-polig 230/400V C-Charakter.11025,3025,30435070233507023HAGE FI-Schutzschalter 40A       CDS440D
      QuickConnect Typ A 4-pol. 30mA 400V10130,9030,90535070923507092HAGE Gabel-Phasenschiene 10qmm   KDN363F
      QuickConnect 3-polig+N 12M1105,255,25629661202966120HAGE Sammelschiene 1feldrig        ZM11C
      universN L=245mm 12x5mm 1 Stück1103,653,65729661212966121HAGE Sammelschiene 2feldrig        ZM12C
      universN L=495mm 12x5mm 1 Stück1107,307,30Gesamtwert: 372,88Seite 1 von 1 
      

      Here is a Exempel file: https://imgur.com/a/RsMnG4A

      This is my code:

      
      from PyPDF2 import PdfFileReader, PdfFileWriter
      import PyPDF2
      
      with open('AB_NEU.pdf', "rb") as f:
          reader = PyPDF2.PdfFileReader(f)
          print(reader)
          page = reader.getPage(0)
          print(page)
          text = page.extractText()
          print(text)
          #for texts in text:
          #    print(texts)
          #    print('____________')
      
      

      I will be glad to receive any help.

      cvp 1 Reply Last reply Reply Quote 0
      • cvp
        cvp @DavinE last edited by

        @DavinE Unfortunately, doc says "PageObject.extractText()
        Locate all text drawing commands, in the order they are provided in the content stream, and extract the text. This works well for some PDF files, but poorly for others, depending on the generator used. "

        DavinE 1 Reply Last reply Reply Quote 0
        • DavinE
          DavinE @cvp last edited by

          @cvp

          Yes, I have also seen that unfortunately does not work as I would like. I will now make me for the one manufacturer which does not provide me with csv files and let me read the PDF.

          I thank you anyway

          And I wish a Happy New Year xD

          1 Reply Last reply Reply Quote 0
          • JonB
            JonB last edited by

            You could also try using PDFKit in objc_util. If your documents are always the same format, you can select text from specific regions, which might fix problems with text being out of order. You can also return all text in the page.

            1 Reply Last reply Reply Quote 0
            • JonB
              JonB last edited by

              Actually, you should check out pdfminer, which has ways of detecting tabular data based on bounding boxes. Camelot and tabula may also work.

              1 Reply Last reply Reply Quote 0
              • DavinE
                DavinE last edited by

                I think this is too much for me and my knowledge....
                But thanks for your help

                1 Reply Last reply Reply Quote 0
                • First post
                  Last post
                Powered by NodeBB Forums | Contributors