Before 0.2.17, I looped manually over the image, and for each opaque pixel, I drew a transparent pixel to the target. That was a naïve implementation, mostly copied from A4 Lix. While I used calls to the video hardware, they were many, many calls per tile.
In 0.2.17, I have a fully white, fully opaque copy of the tile in VRAM. For drawing dark, I blit the white piece with a blender that deducts from the target pixels (colors and alpha) the source pixels (colors and alpha).
Now, dark drawing has about the same VRAM performance as a normal blit.
-- Simon