Skip to main content leehalls.net

Python Export for Specific Tagged Headlines & Files in Org Documents

[2022-11-11 Fri] renamed this article because in my mind “parser” indicates something more than this is. This is just a way to read an org file and extract headings identifed with a specific tag and only the linked files for that heading/tag not all the files i have stored in the data directory.

The reason this genuine example of poor python exists is because of my workflow … so i keep project notes in org files (of course!) and any associated files that are linked eg emails or images are saved in the relevant sub-directory eg

batch code snippet start

- top dir
  - all orgfiles
  - project_dir_1
  - project_dir_2
  - project_dir_3

batch code snippet end

However if i want to share specific tagged topics to say handover a part of the project to someone else or perhaps for financial arguments I can export or publish (in fact i’ve got quite a good publish.el) but whilst this copies only the nodes that are specifically tagged it also copies every single file that is stored in the project sub-directory. Not good when some topics are private or covered by an NDA nor if i want to keep the size down. I did search and search but came up empty weirdly so as i sort of expected this to be a normal requirement but perhaps thats the point of the emacs\org combination. It is so malleable we can create our own curated libraries hence who is to say what is “normal” thus i created this monstrosity however i may have missed something but in the meantime it is here if anyone wants to try it or if anyone can show me a better way please do!

What is it?

  • well it is a python program to take a specified orgfile, parse it searching the headers for specific tags and then only output to the export directory the files specifically associated with those tagged nodes.

What is not good at?

  • its crude, the output doesnt consider sub-headings etc so everything is treated with the HTML <H1> tag and images are not re-sized or have size attributes applied or in other words if you’ve a big image in your orgfile that you resized using the

emacs-lisp code snippet start

#+ATTR_HTML: :width 200
#+ATTR_ORG: :width 200

emacs-lisp code snippet end

attributes …. well thats ignored.

Also weirdly some links do not have a space applied so the continuation text is immediately after the link, a minor problem but from a visual point an annoying one.

Does it work?

  • Yes, i’ve tested it on macBook air copying across one specific project file and its sub-dir (data safety!) a file with 1049 lines and with a sub-directory containing >300mB of files and it only copied across those files that were pertinent outputting an HTML file that was if not nice to look at usable.

Is it perfect?

  • HELL NO …. did you read my sites intro? i’m an amateur i do this because i find it interesting & could be a lot better at it but because i dont program frequently a bit like language usage (i was conversational in German, understood a lot of Japanese & Swedish and others) if you dont use it well it dies and you end up remembering only “konichi wa” or “Minä rakastan sinua” if you speak any Finnish.

So on to the code:

First it relies on two libraries the orgparse library which can be found here: https://github.com/karlicoss/orgparse and then an org to html library https://github.com/honmaple/org-python

Import the libraries needed

python code snippet start

from orgparse import load, loads
from orgpython import to_html
import shutil
import re
import os

python code snippet end

next we load the file to parse entries for

python code snippet start

root = load('project_340.org')

python code snippet end

open the output for file for writing;

python code snippet start

f = open('export/test.html', 'w')

python code snippet end

this is the old function is called to write the headline and raw body

python code snippet start

def output(hdg,body):
    # write heading & body to file
    print('creating output')
    f.write(hdg)
    f.write('\n')
    f.write(body)

python code snippet end

now replaced with

python code snippet start

def convert(hdg, bodytext):
    # here be dragons
    #print(to_html(bodytext, toc=True, offset=0, highlight=True))
    #f.write(to_html(hdg, toc=True,offset=0,higlight=True))
    f.write('<h1>' + hdg + '</h1>')
    f.write(to_html(bodytext, toc=True, offset=0, highlight=True))

python code snippet end

function to identify the filename inside the body text, here we split the entire text into individual words and poll through them to find the relevant text

python code snippet start

# old function
def copyfile(filename):
    # here be dragons
    for word in filename.split():
        if 'file:' in word:
            fn = word[7:word.find(']')]
            shutil.copyfile(fn, 'export/'+fn)

python code snippet end

the problem with above function is that i need to re-write my outlook email save macro, currently it saves the filename with spaces so the above will split the filename at the space meaning i may not have the complete filename eg [[file:this_example would be saved incorrectly.msg]] this next function avoids the problem by using regex something which i still even after all these years haven’t got to grips with.

python code snippet start

def copyfile(filename):
    pattern = re.compile(r'[\[\]]+')
    inputtxt = filter(None, pattern.split(filename))
    for dat in inputtxt:
        if "file:" in dat:
            fn = dat[5:]
            name = dat.split('/')
            dir = 'export/' + (name[0][5:])
            print('filename info', dir)
            os.makedirs(dir, exist_ok=True)
            shutil.copyfile(dat[5:], dir +'/' +name[1])

python code snippet end

What it effectively does is split the body text identified by orgparse into each word then looks for the keyword file and then we split the first 7 characters from the string to remove [[file: element before using find to identify the position of the square close bracket and so have isolated the filename into the variable fn and this repeated for every file link found in the body text allowing us to copy only the relevant files to our export directory.

Finally the main section parsing the orgfile - note we use format=‘raw’ to get the complete org link

python code snippet start


# traverse through all the nodes/headings
for node in root[1:]:  # [1:] for skipping root itself
    #conditional choose headings which match specific tags
    if node.tags == set(['the_tag_i_want']):
        hdg = node.heading
        bdy = node.get_body(format='raw')
        copyfile(bdy)
        convert(hdg,bdy)

f.close()
print('----')

python code snippet end