Python Export for Specific Tagged Headlines & Files in Org Documents

October 28, 2022

[2022-11-11 Fri] renamed this article because in my mind “parser” indicates something more than this is. This is just a way to read an org file and extract headings identifed with a specific tag and only the linked files for that heading/tag not all the files i have stored in the data directory.

The reason this genuine example of poor python exists is because of my workflow … so i keep project notes in org files (of course!) and any associated files that are linked eg emails or images are saved in the relevant sub-directory eg

- top dir
  - all orgfiles
  - project_dir_1
  - project_dir_2
  - project_dir_3

However if i want to share specific tagged topics to say handover a part of the project to someone else or perhaps for financial arguments I can export or publish (in fact i’ve got quite a good publish.el) but whilst this copies only the nodes that are specifically tagged it also copies every single file that is stored in the project sub-directory. Not good when some topics are private or covered by an NDA nor if i want to keep the size down. I did search and search but came up empty weirdly so as i sort of expected this to be a normal requirement but perhaps thats the point of the emacs\org combination. It is so malleable we can create our own curated libraries hence who is to say what is “normal” thus i created this monstrosity however i may have missed something but in the meantime it is here if anyone wants to try it or if anyone can show me a better way please do!

What is it?

well it is a python program to take a specified orgfile, parse it searching the headers for specific tags and then only output to the export directory the files specifically associated with those tagged nodes.

What is not good at?

its crude, the output doesnt consider sub-headings etc so everything is treated with the HTML <H1> tag and images are not re-sized or have size attributes applied or in other words if you’ve a big image in your orgfile that you resized using the

#+ATTR_HTML: :width 200
#+ATTR_ORG: :width 200

attributes …. well thats ignored.

Also weirdly some links do not have a space applied so the continuation text is immediately after the link, a minor problem but from a visual point an annoying one.

Does it work?

Yes, i’ve tested it on macBook air copying across one specific project file and its sub-dir (data safety!) a file with 1049 lines and with a sub-directory containing >300mB of files and it only copied across those files that were pertinent outputting an HTML file that was if not nice to look at usable.

Is it perfect?

HELL NO …. did you read my sites intro? i’m an amateur i do this because i find it interesting & could be a lot better at it but because i dont program frequently a bit like language usage (i was conversational in German, understood a lot of Japanese & Swedish and others) if you dont use it well it dies and you end up remembering only “konichi wa” or “Minä rakastan sinua” if you speak any Finnish.

So on to the code:

First it relies on two libraries the orgparse library which can be found here: https://github.com/karlicoss/orgparse and then an org to html library https://github.com/honmaple/org-python

Import the libraries needed

from orgparse import load, loads
from orgpython import to_html
import shutil
import re
import os

next we load the file to parse entries for

root = load('project_340.org')

open the output for file for writing;

f = open('export/test.html', 'w')

this is the old function is called to write the headline and raw body

def output(hdg,body):
    # write heading & body to file
    print('creating output')
    f.write(hdg)
    f.write('\n')
    f.write(body)

now replaced with

def convert(hdg, bodytext):
    # here be dragons
    #print(to_html(bodytext, toc=True, offset=0, highlight=True))
    #f.write(to_html(hdg, toc=True,offset=0,higlight=True))
    f.write('<h1>' + hdg + '</h1>')
    f.write(to_html(bodytext, toc=True, offset=0, highlight=True))

function to identify the filename inside the body text, here we split the entire text into individual words and poll through them to find the relevant text

# old function
def copyfile(filename):
    # here be dragons
    for word in filename.split():
        if 'file:' in word:
            fn = word[7:word.find(']')]
            shutil.copyfile(fn, 'export/'+fn)

the problem with above function is that i need to re-write my outlook email save macro, currently it saves the filename with spaces so the above will split the filename at the space meaning i may not have the complete filename eg [[file:this_example would be saved incorrectly.msg]] this next function avoids the problem by using regex something which i still even after all these years haven’t got to grips with.

def copyfile(filename):
    pattern = re.compile(r'[\[\]]+')
    inputtxt = filter(None, pattern.split(filename))
    for dat in inputtxt:
        if "file:" in dat:
            fn = dat[5:]
            name = dat.split('/')
            dir = 'export/' + (name[0][5:])
            print('filename info', dir)
            os.makedirs(dir, exist_ok=True)
            shutil.copyfile(dat[5:], dir +'/' +name[1])

What it effectively does is split the body text identified by orgparse into each word then looks for the keyword file and then we split the first 7 characters from the string to remove [[file: element before using find to identify the position of the square close bracket and so have isolated the filename into the variable fn and this repeated for every file link found in the body text allowing us to copy only the relevant files to our export directory.

Finally the main section parsing the orgfile - note we use format=‘raw’ to get the complete org link


# traverse through all the nodes/headings
for node in root[1:]:  # [1:] for skipping root itself
    #conditional choose headings which match specific tags
    if node.tags == set(['the_tag_i_want']):
        hdg = node.heading
        bdy = node.get_body(format='raw')
        copyfile(bdy)
        convert(hdg,bdy)

f.close()
print('----')

Python Export for Specific Tagged Headlines & Files in Org Documents

Orgmode Image Insert on Windows 10

Outlook Events for Orgmode

Dotemacs in Orgmode?

HTML/CSS Formatted Agenda for Emacs Batch Output

Dirty Python for Quick Link Creation