synOCR synOCR - GUI für OCRmyPDF

geimist · 04. Mrz 2019

Hallo koen,

grundsätzlich definierst du die Sprache über den Parameter -l
Sollte deine gewünschte Sprache nicht enthalten sein, so müsstest du, wie vermutet, das Dockerimage ocrmypdf-polyglot verwenden. In der Datei "/volume1/@appstore/synOCR/etc/Konfiguration.txt" musst du die Variable "dockercontainer" entsprechend setzen. Das kannst du auch mit diesem Befehl (zum Beispiel im Aufgabenplaner) tun (als root):

Rich (BBCode):

synosetkeyvalue "/volume1/@appstore/synOCR/etc/Konfiguration.txt" dockercontainer "jbarlow83/ocrmypdf-polyglot"

Mir ist bewusst, dass es dafür Bedarf gibt und eine Implementierung über die GUI steht bereits auf der ToDo-Liste.
Ich habe bisher noch keine Tests in dieser Richtung durchgeführt!

koen · 05. Mrz 2019

Jetzt laüft es, herzlich "bedankt" Geimist!

Jetzt habe ich noch: 1 Frage, 1 Lösung und 1 Versuch, ich werde es posten in 3 separate berichte. Ich hoffe es ist kein Problem wenn ich in English weitergehe, ich bemärke es fällt mir ganz schwierig um auf deutsch zu schreiben (muss jedes mahl erneut anmelden weil es zu lange dauert)

Question:
Is there a way to include a textfile with ocr-tags? Maybe by entering some reffering code in the form field of the GUI, or perheps with a similar Synosetkey command (I really liked that solution to my first question).
It should be included for each new run of the synocr-start

Subquestion: Will more tags to search for have al lot of impact on the speed of the total process? My guess is most time is used for starting the docker, not so much for actual OCR and tagging/renaming.

geimist · 05. Mrz 2019

koen schrieb:
Question:
Is there a way to include a textfile with ocr-tags? Maybe by entering some reffering code in the form field of the GUI, or perheps with a similar Synosetkey command (I really liked that solution to my first question).
It should be included for each new run of the synocr-start …

I don't like the way tags are currently handled in the GUI.
Your idea with a separate text file (e.g. in the INPUT folder) would be a good alternative if I can't think of anything better for the GUI.

Subquestion:
Will more tags to search for have al lot of impact on the speed of the total process? My guess is most time is used for starting the docker, not so much for actual OCR and tagging/renaming.

It's like you say: Most of the time the ocrmypdf Dockercontainer needs. The tag processing is negligible.

koen · 05. Mrz 2019

Thanks for your quick replies! It's no problem for me to read german if that would help other users in this forum.

Here is the contribution (Lösung) I promised:

Instead of Zeitplaner I use a script to run synOCR whenever a file is changed (created by scanner) in the source folder, it uses the watchmedo programme as described in the ocrmypdf documentation.

It requires some skills since you need to install watchmedo via command-line SSH login (at least I dont know how to do it in the DSM/browser?) following this procedure:

Please note that you must only use this at your own risk, since I'm not a programmer and I don't have any other experience in Lunix.
If there are any experts around, please let me know if you see any risks in using these scripts
Also do not expect much of my support when trying this yourself, but of course I will reply in this forum if I can think of anything usefull.
I figured this out by just googling a lot, trying a lot, and googling again for my errors.

-pip must be installed, possibly this is allready the case if python is installed (else see http://pip.pypa.io/en/stable/installing/)
Make sure you stick to the right version of pip, I remember having issues mixing up pip and pip3. I guess both are okay, but just use the same for every command?

-install watchmedo: http://github.com/gorakhargosh/watchdog#installation

Place a file like this on your NAS:

Rich (BBCode):

#!/bin/sh

watchmedo /volume1/your/scan/folder shell-command -p'*.pdf' -c'ocrmanage.sh' -c'echo "starting ocrmanage.sh"' --drop

# don't forget the final dot

Don't forget to change your/scan/folder to the folder you scan your PDF's to!
You can place this script in any folder your like, it doesn't have to be the folder with PDF's, perhaps it's better to use a folder like volume1/myscripts for it. You can name the file whatever you want, i "use watchfolder.sh":

Then place a second script in the same folder as the first script, call it "ocrmanage.sh"

Rich (BBCode):

#!/bin/bash

echo "ocrmanage has started"
while [ -e "/volume1/documenten/administratie/scans/preocr/*.pdf"]; do

	echo "synOCR is started"
	/usr/syno/synoman/webman/3rdparty/synOCR/synOCR-start.sh
	wait
	echo "synOCR is done"

done
echo "ocrmanage is done"

Both scripts should be made executeable (chmod 755?)

Then via DSM control panel->task schedule create a task to start the first script on every startup of you NAS
Then activate the task once (so you don't need to restart your NAS)
The first script will activate watchmedo.
Watchmedo wil recognize new pdf uploads in the specified folder and start the second script. You can continue to add new scans.
Meanwhile the second script wil start synOCR for all pdf's in the folder, after synOCR it will rescan the folder for PDF's that where put there while synOCR was busy, it wil restart untill the folder stays empty.

It should be running fine now for a single folder.

You can not use it for 2 folders at once I think, you will get an error that synOCR is allready running.
This could possibly be solved:
-by adding watchmedo tasks in the first script
-and in the second script:
checking if synOCR is allready running before running it again,
and restarting synOCR as long as PDF's exist in any of all the folders to be watched.

geimist · 05. Mrz 2019

Vielen Dank für deinen Input.

Das ist bestimmt eine nützliche Lösung für alle, die sich die Ordnerüberwachung selbst einbauen möchten. In synOCR möchte ich es immer gern vermeiden, zusätzliche Software auf der DS zu installieren (wenn möglich und notwendig, nutze ich statisch gelinkte externe Programme, die out of the box laufen). Möglicherweise sehe ich mir mal in Zukunft an, ob man den Dockercontainer 'watchdog' optional für die Ordnerüberwachung mit einbindet.

Bis dahin ist es auch gar kein Problem, wenn man im DSM-Aufgabenplaner synOCR jede Minute starten lässt. Ist nichts zu tun, so wird auch kein Dockercontainer gestartet, sondern synOCR gleich wieder beendet.

koen · 05. Mrz 2019

I also announced a feature request:

I would really like the option to scan multiple documents in one go using seperator sheets. I would print multiple sheets containing a text bookmark and stick these betweeen may paper documents (I use distinctive colour en slightly thicker paper so I can easily remove and re-use them)

I think of a script using pdftk to search for the separators (starting from the back of the document is best i think) and then chop the document in two
This can be repeated for the first document until no separator markers are found.

Of course this script can be use apart from synOCR,
BUT:

Seperator (text)recognition is only possible after OCR, but chopping the PDFs is necessary before tagging and renaming

So it really needs to be part of the synOCR scripts, unless somebody knows another way to recognize and chop PDF's without OCR?

Would it be possible to include a similar script in the synOCR workflow?

geimist · 05. Mrz 2019

Verstehe ich dich richtig:
Du hast mehrere Dokumente, die du mithilfe von Trennblättern zu einem zusammenlegst um es so zu scannen, aber anschließend soll die Software es wieder aufteilen?

Wenn jemand die Implementierung zuarbeitet, kann ich das gerne mit aufnehmen. Wenn ich es für mich auf die ToDo-Liste setze, weiß ich nicht, ob und wann ich es umsetzen kann.

Der Ablauf ist so:
1. ocrmypdf legt einen Textlayer auf das PDF
2. pdftotext erstellt eine Textdatei der ersten Seite (alternativ kann man ja auch alle Seiten ausgeben lassen, was aber false positive Ergebnisse begünstigt)
3. in der Textdatei wird nach Datum und Tags gesucht

Dein Workflow müsste also zwischen Schritt 2 und 3 eingefügt werden.

koen · 05. Mrz 2019

geimist schrieb:
In synOCR möchte ich es immer gern ... out of the box laufen ... kein Problem, wenn man im DSM-Aufgabenplaner synOCR jede Minute starten lässt ...

Mann kann ins GUI nur Stunden auswahlen bei Frequenz für Zeitplaner.

Watchmedo hätte ich ausprobiert um nog weitere änderungen an PDF's machen zu können wie separation usw.

geimist · 05. Mrz 2019

Ja, der Zeitplaner in synOCR kann nur Studenintervalle, aber im DSM Systemsteuerung > Aufgabenplaner geht es bis zu Minuten.

koen · 05. Mrz 2019

That is indeed the desired setup

I don't have knowlegde of the exact worklfow inside your program, but cutting should be done after complete OCR but before indexing the first page I guess. Otherwise the smaller documents don't have data to search for tags.

koen · 06. Mrz 2019

Okay, now that I can use synOCR, i've started my papierloses heimburo today and I really want to share one of my findings (in this forum since synOCR is important in my setup):

First I was very disappointed with the limited functionality of synology drive; very few search options for used labels, and most important no preview window or even miniatures for PDFs!
Since I don't need the extensive functions of ecodms or mayan-edms, I came across Tagspaces which really suits my needs: immediate preview of all scanned PDF's en then just drag the appropriate tags onto the file.
This is really great: especially since I can now use the not-so-usefull watchmedo script to strip the # from the filenames that were created by synOCR!
in Konfiguration I setup synOCR to place tags between brackets [ ... ] matching the Tagspace filename format, then a script takes out the synOCR "#" characters and voilà! tags are automatically imported in Tagspaces...

whocares · 06. Mrz 2019

Cool wäre es, in einem Rutsch mehrere Dokumente zu scannen und zwischendrin "Trennblätter" zu haben. Auf dem NAS würde dann eine Logik diese eine Datei (abhängig nach den Trennblättern) in einzelne Dateien speichern.

geimist · 06. Mrz 2019

Das entspricht ja dem Vorschlag von koen.
Eine Umsetzung wird aber nicht heut und morgen, da ich auch erst einmal sehen muss, mit welcher Software ich das umsetzen kann (wenn jemand einen Tipp hat - immer her damit

). Das nächste größere, was ich umsetzen möchte, sind Profile. Dazu muss ich aber von der Konfigurationsdatei auf eine Datenbank umstellen, was wiederum zusätzlichen Aufwand bedeutet.

koen · 06. Mrz 2019

Indeed I would like that option very much

There are several tools alike pdfgrep that can search for textstrings (zB marker printed on your Trennblatt) and return pagenumbers for hits, then you could loop these numbers to cut one end off the file with pdftk.

For easiest execution I think either the loop or the search should be from back to front (otherwise pagenumbers should be recalculated after every cut). I really hope someone can implement this. Since I don't have any education in programming I wil not try myself, my coding will be very crappy. I will be glad to help and will look around for a usefull tool, I will return with suggestions.

A also suggested a textfile to configure the tags for synOCR to look for. Maybe it can even help importing the tags from Tagspace.

One question about tags in synOCR: I couldn't figure out if there is a possibility to assign tags without the exact searchstring. I like short tags to use with Tagspace, but I don't like false positives so search strings should be long. In synOCR tag input field I tried "rolls royce=rr " for my garage bills but that doesn't work... I didn't really understand the german helptext for this field? Does the "="option only work when I use subfolders for storage?

koen · 06. Mrz 2019

https://askubuntu.com/questions/454934/how-can-i-extract-pages-containing-a-given-string-from-a-pdf-file

This does the exact opposite of what I'd like but it contains all the right tools! I might dive in to it later...

geimist · 06. Mrz 2019

pdftk scheint ein guter Weg zu sein, aber leider gibt es das nicht auf der DS. Also entweder findet (oder kompiliert) es jemand statisch gelinkt für x86_64, oder wir müssen schauen, ob man es mit einem zusätzlichen Dockercontainer macht.

Es gibt auch noch pdfseparate (gehört zu poppler - fork von xpdf). Aber auch davon habe ich noch keine statisch gelinkte Version gefunden.

koen · 08. Mrz 2019

That's a shame, pdftk looked promising, I don't really know what you mean about static linking, that goes beyond my unix or programming knowledge ...

Maybe instead of splitting a document there is the option of just pdf-printing the pages you need? I found this on a forum:

A better tool for the job is Ghostscript, which you probably already have installed:
$ gs -dNOPAUSE -dBATCH -dFirstPage=2 -dLastPage=2 -sDEVICE=pdfwrite -sOutputFile=dest.pdf -f src.pdf
This passes the PDF data through unchanged, since Ghostscript understands PDF (a PostScript derivative) to a much deeper level than ImageMagick does.

other options may be:
http://pypi.org/project/PyPDF2/
or
http://pypi.org/project/pdfsplit/ (last update 2008..)
Can python scripts/prgarams be used without an extra docker container?

(And can you may be help me out with the other question in my last post, about assigning tag "abc" for searchstring "tuvwxyz"?)

geimist · 08. Mrz 2019

koen schrieb:
… I don't really know what you mean about static linking, …

Wenn ein Programm statisch gelinkt und gebaut wurde, so besteht es in der Regel aus einer einzigen Datei. Alle Systemabhängikeiten (z.B. Bibleotheken) sind mit eingebunden. Diese Datei kann man dann in der Regel auf jedem beliebigen (kompatiblen - hier also x84_64 Linux) Gerät ausführen. Eine individuelle Installation ist dann nicht notwendig. Und hier siehst du bestimmt den Vorteil

Die anderen Links sehe ich mir später an. Derzeit habe ich nicht die nötige Zeit für dieses Feature - aber vielleicht später …

(And can you may be help me out with the other question in my last post, about assigning tag "abc" for searchstring "tuvwxyz"?)

So, wie von dir gewünscht, funktioniert das derzeit nicht.
Man kann Tags lediglich zu Kategorien (Ordner zum Einsortieren)zuweisen, aber Tags kommen so, wie sie sind, in den Dateinamen. Vielleicht kann ich das auch mit auf die ToDo-Liste setzen.

koen · 10. Mrz 2019

As far is I understand ghostscript needs seperate installation via ipkg, so no static linking.

But it's still very interesting I think: ghostscript has an "inkcov" argument, calculation C,M,Y and K percentages per page. That could be really usefull! Since I wanted to use coloured seperation sheets, this means:
-with ghostscript It should be possible to detect the coloured seperator pages and also print inidividual documents; even using one single tool
-and also it would be possible to run this process without invoking tesseract first, so there is no need to rethink the synOCR structure

Neodys · 11. Mrz 2019

Hallo Stephan,

sehe grad, dass die neue Version bereits meinen Wunsch eingearbeitet hat, erst die finalen Dateien im Ausgabeordner abzulegen. Das ging ja schnell. Herzlichen Dank an dich für die Mühe.

Beste Grüße