When it comes to digitalization I am a great advocate for the paperless office and managing my paper work without any paper at all. But due to the fact that public authorities, insurances or other institutions still send me letters or printed contracts, there is nothing much I can do than scanning that stuff manually.
Normally I’d use an old Canon flatbed scanner with the obvious drawback that it not only takes forever to wait for a single page to be scanned, but it requires me to have access to the scanner ergo I can only scan at home. After some research I found that there are quite a list of solutions for that such as fileee, scanbot (former Doo), CamScanner or Tiny Scanner. All of those will do the job, but either they lack functionalities, being sold of as a part of a freemium model, require a login or way to many clicks to achieve what I want. The tools are also not open source. This is not a problem per se, but from the beginning I was under the impression that it will only require orchestrating some already available open source tools, a little glue code and a server to build something that
- is tailor-made for my use case thus requires as few clicks as possible
- consists of open source components only, so anybody can adapt it to their needs
- does not transfer my documents to 3rd party clouds
- does not require new hardware
So after some hacking I came up with scan2mail, a 100% organic, free-range scanner toolchain for lazy people. You can find the source code on GitHub.
The scan2mail toolchain
The toolchain that I propose is a tiny server application that can be self-hosted by anyone with access to a server and some DevOps tools in her or his belt.
Step 1: Digitalizing and uploading the document
The first step is fairly simple. As client device I use my smartphone to take pictures of the documents I want to scan. The toolchain is forgiving towards slight rotations, limited light and imperfect clippings. I am only required to take pictures of all pages I want to scan and make sure that those have the correct basic rotation and not too hard shadows on it.
On the phone you can use any application to take and edit pictures. To upload documents to a server I use Nextcloud, a fork of Owncloud, which is an open source, self-hosted file share platform and thus a free alternative to Dropbox or Google Drive. Nextcloud and Owncloud both have open-source Android-apps to push the images to your server instance.
Step 2: Pulling the files to the server
Now that I have uploaded my pictures to the server, the server-side toolchain comes into play. I use a Python-script to hold it all together. The first processing step here is to download the files from my Nextcloud using the official command-line client
owncloudcmd. Notice that this is a client for Owncloud, not Nextcloud. I found most of the tools to be interoperable. The synchronization is a mere call like
owncloudcmd -s -h /home/basti/working_dir/ https://bastitee:email@example.com/remote.php/webdav/Scan2Mail
with my local working directory
/home/basti/working_dir/ and a remote folder
Scan2Mail where I uploaded the images. To make sure that we have available all new images of a batch, the actual function repeats this step until no new files are found. I found this to work quite well, even though it is really not a well-designed solution. I would love to know if you have a better idea or solution.
I haven’t tried to run this toolchain with non-free services like Dropbox, I know that Dropbox has the required command-line tools for the server-side available, so it should be possible to use this as well. For Google Drive or Apple iCloud I am unsure, but I assume it would require more preparation due to the fact, that these services will authorize over API keys, tokens, OAuth or other complex authorization methods.
Step 3: Image processing
I should probably start this section with yet another praise for all those awesome, free and open-source tools available online. But to keep it short: My assumptions were correct. Scanner preprocessing is a solved problem and it was ridiculously easy to setup. Everything I needed was 100% open-source code. The toolchain I use consists of
- Apply image rotation from EXIF-data and strip metadata entirely using
convert(part of ImageMagick)
convert infile.jpg -auto-orient -strip outfile.jpg
- Post-process basic image with deskewing, removing borders, normalizing illumination, despeckling and more using
scantailor-cli -l=1 infile.jpg outfolder/
Notice that I use
scantailor-cli with close to no configuration at all. The
-l=1 only forces scantailor to output a one page layout, instead of auto-splitting pages on visual breaks. The default values work perfectly for portrait-oriented letters, contracts etc., but will throw away any coloring information. I don’t need that, but if you do,
scantailor-cli is the tool to be configured.
- Converting TIF to single-page PDF using
tiff2pdf(part of libtff)
tiff2pdf -p A4 -F -o outfile.pdf infile.tif
-p A4 and
-F are used to center the TIFF on an A4-sized page if the scan and the border-removal during the
scantailor-cli step yielded an image with different dimensions. The default compression yields documents around 0.5 to a few Megabytes in size (depending on the number of pages) and required no additional configuration.
- Create batches of single pages by comparing creation dates using a Python-function
That is the major part of the script that I wrote. I try to extract the creation date of the image by filename first (Android’s default filename pattern for images is
IMG_20170312_133700.jpg) and, if not found, then by the EXIF
date:create datum using
identify (part of ImageMagick).
identify -verbose infile.jpg | grep "date:create" ...
It should be easy to extend the filename extraction to other common filename patterns. I found that the fallback using EXIF sometimes leads to weird dates, that will mix up the resulting documents. It seems that this is a common issue for tools operating on images. They always seem to mess a little with the EXIF data, especially the
date: fields. As with my personal photo collection, I therefore prefer filename-based creation dates, it’s a performance advantage any way.
Batching follows the KISS-principle by heart. If you have a list of images, then one batch contains all images with a creation time difference of one minute or less. Yes, its simple and maybe underdeveloped, but it also means I don’t have to interactively tell the process where a multi-page document starts and where it ends. I only have to wait 60 seconds to start a new document, which proved to be sufficiently reliable during my tests.
- Joining batches of documents to a single PDF using
pdftk infile1.pdf infile2.pdf cat output outfile.pdf
A no-brainer as you can see. The following image compares the input image from my smartphone (left) with the resulting PDF-image (right).
Step 4: Publishing the result
After processing is done I want to access the result as soon as possible. I decided to go for email, because I will receive a push notification on my phone once the scanned PDF arrived and it’s a backup of the PDF at the same time. Sending out email should work for any SMTP-based mail host out of the box. Only problems that I discovered was Google mail, because the Python-mail module is not an approved client to Google. In this case it’s required to allow “lesser secure apps” to send out email in your name.
The toolchain can be configured via a small JSON-file. Currently you need to define a working folder, your email configuration and your Owncloud/Nextcloud credentials. Run the toolchain like
python3 scan2mail.py -i config_1.json -i config_2.json
Notice that you can input multiple configuration files to support multiple users.
For hosting I went for Docker to provide a consistent environment and a small shell-script to handle Docker-image building, (re-) starting the container and locking the access. It’s recommended (though not required) to reuse a single container, so that you don’t download your remote image source folder every time you run the toolchain.
To invoke the script I added it to my crontab to be run every minute. The mentioned lock file takes care of protecting the script to be run concurrently.
*/1 * * * * basti /scan2mail/docker-run.sh /scan2mail/config-1.json /scan2mail/config-2.json
If you want to run the script without Docker, make sure to read through the Dockerfile to reproduce the required system.
I tested this setup with my better half and we quickly got rid of a huge Leitz folder – even though we Germans love them so much. Most of the minor quirks are gone and the scanner is now in productive use for more than a week.
I see further improvements on batching the images other than implicitly looking for timestamp-differences greater than one minute. And for invoking the process: I would love to see a reactive solution that starts the process once all the images belonging to a document are uploaded. Performance-wise the biggest issue currently is the ugly loop that waits for documents to arrive.
Let me know your thoughts on Twitter, try it out, share it, remix it, but in any case..