Formula Tokenizer

A few years ago, I wrote a Formula Formatter add-in to present a long formula for easier reading. It does it through a process called Tokenizing, which is basically a process of putting the components of a formula into labelled boxes.
That add-in was compatible with Excel 2003 formulas, but Excel 2007 brought us extra formulas which meant my add-in was incomplete.

It was the first time I’d ever developed a tokenizer, so after a while I thought of better ways of doing it.
From my Oracle days, I recalled those diagrams in SQL manuals. Kind of like bubbles with lines leading in and leading out detailing the syntax of a statement. At the time I figured I could reuse this technique to document Excel Formulas. Of course, now that I’ve done a bit of reading, I’ve learned these are called Syntax Diagrams, and that’s exactly what they were intended for… visualising the syntax of formulas, or really, visualising BNF. BNF itself is a way of documenting a programming language’s grammar.

After some weeks of ripping apart formulas, I completed a BNF document describing Excel Formulas.
Probably the very next day, I stumbled across this document on Microsoft’s website: Excel Binary File Format (.xls) Structure Specification. It’s got BNF galore.
I felt like the guy who spent a weekend learning to program his VCR without a manual, only for the manual to turn up the next day.

So, with syntax in hand, I developed a tokenizer, which was quite a lot of fun.
It does Excel 2007 Tables, which was really my goal from the start. I also tidied up the bits around external named references.
It should work with International versions, but I use English only.
Ron de Bruin gave me some advice in this area (thanks Ron!), but still feel there might be problems with errors such as #VALUE!. You’ll find them defined once at the top of the module if you want to play.

Now that I’m finished, I’m sharing… You can download the code here (a zip file at 75 KB)
I’m not really interested in developing a user interface for formula formatter this time around. I’m happy to leave that to anyone else.

I wrote a proof of concept userform. In the image below, I’ve doubleclicked the IF function, and the whole IF, including it’s content, is highlighted.
tokenizer

12 Comments

  1. Conrad Brits says:

    Rob, I tried to down load the 2005 AudXL app without success! Is it still available?

  2. Rob van Gelder says:

    You can find it on my website:
    http://vangelder.orcon.net.nz/excel/

  3. Kanwaljit says:

    Rob,
    But I found it at the following link. Is there any update after that ?
    http://vangelder.orconhosting.net.nz/excel/audxl.html

  4. Conrad Brits says:

    Rob, that is the link that I used but, when clicking through, it changed to http://vangelder.orconhosting.net.nz/excel/ with a Google Trivia link and the rest of the page being blank. Something I’m doing wrong here or a setting in IE on my PC?
    The link Kanwaljit has did work though.
    Regards.

  5. Rob van Gelder says:

    Sorry, I’m better at Excel than at making webpages!
    Kanwaljit’s link is what you want. That was the last version I made.

  6. Ben Norris says:

    Hi Rob,

    I’ve recently been playing around with your original tokeniser (from AudXL) and have found it to be excellently written, fast and very useful. I’ve only just found this article where you introduce the new version. To be honest, I’ve not really used the original code in XL2007 much so hadn’t noticed any issues with it in that respect, but am keen to use the new code as I’ll doubtless be using XL2007/2010 more going forward.

    Two quick points:

    1. In the original version, the parser incorrectly recognised a cell with the text “Unit(s)” in it (note: not entered as a formula, just as a text label) as a function called Unit. This was pretty easy to fix but I thought you might like to be aware of it. I don’t think the new version has the same problem. It happens with any text with an open bracket after it.

    2. In the old version you had a specific token type for ‘Operand Reference Named Range’ which seems to have gone from the new version – both cell references and named ranges are returned as token type ‘Reference’. I quite liked having the ability to distinguish between a reference to a cell and a reference to a named range so I was thinking of adding some code to do this. In your original tokeniser you wrote a function called ‘IsReferenceA1′ to check if a reference was to a cell or to a named range and I was thinking of re-writing this for XL2007 (to take account of the Big Grid). Before I do that I was wondering if you’d already tried that and abandoned it as not being possible, or if there was some other reason you left it out.

    Many thanks for offering this great code to the community – it’s been hugely inspiring, educational and useful!

    Rgds,

    Ben

  7. “Now that I’m finished, I’m sharing…”
    Thanks Rob.
    I started using BN back in the early seventies, but have been away for a while.
    I felt a need to write a formula parser, and you popped up in a search.
    Your comment above reminds me of the spirit back in the days of 80-column punched cards ….

    Here you are: Enjoy a free copy of the indexer!
    http://www.Indxr.ca

  8. Rob van Gelder says:

    A bit late in replying, sorry.

    Ben: I decided not to go as far as distinguishing between cell references and named references. I don’t think it would be too hard to work out, but there’s the added problem of 2007 versus 2003 names.
    In 2003 ABC123 is a name, but in 2007 it’s a cell.

    Chris: pleased it’s of use to someone. cheers

  9. Ben Norris says:

    Hi Rob,

    I’ve started using the tokeniser code in anger recently and noticed an ‘interesting’ memory issue that I can’t seem to resolve…

    If you add a sheet to the Parser.xls file with about 40,000 formulae on it (just to give you something to work with) and run the tokeniser on each cell in turn then the memory used by Excel shoots up dramatically (an increase of over 120MB once all cells are processed). FYI I’m running Excel 2010 and the test code I ran looks like this:

    Sub test()

    Dim rngT As Range
    Dim tknT As Tokens

    For Each rngT In Sheet23.UsedRange
        If Left(rngT.Formula, 1) = “=” Then
            Set tknT = ParseFormula(rngT.Formula)
        End If
        Set tknT = Nothing
        DoEvents
    Next rngT

    End Sub

    As you can see I’m killing the tknT variable after each call of ParseFormula and as such wouldn’t expect the memory to rise as much as it is. The other odd thing is that once the code has completed running, if you click on the ‘Reset’ button in the VBE (looks like a Stop button next to the Play and Pause buttons) Excel seems to release all the memory that it has stored up (it takes a while for it to do this though and Excel hangs whilst doing it).

    I can’t seem to see any global variables that are used in the Tokeniser so am really at a loss to understand what’s going on. Any ideas?!

    Many thanks,

    Ben.

  10. Ben Norris says:

    Just found the following and wonder if it might have something to do with it:

    http://www.vbi.org/Items/article.asp?id=106

    Will have a look at the tokeniser code and see if it falls into that trap. My gut feel is that it does…

    Rgds,

    Ben

  11. Ben Norris says:

    Ok, so I’ve found the issue and it is pretty much as described on the link above. The Class ‘Token’ contains a Public member called ‘Parent’ that is of type Class Tokens, thus creating the circularity described in the above article.

    If you remove this member (and all references to it elsewhere in the code) then the memory issue is resolved. Note that in my application I have removed all of the code relating to the example form Rob included so the ‘Parent’ member is actually redundant anyway.

    I’ve also added code at the end of the main ParseFormula procedure to set all local objects to Nothing, as well as code to do the same upon class_terminate for each of the classes.

    Hope someone finds the above useful!

    Ben

  12. Rob van Gelder says:

    Hi Ben.

    Thanks for identifying the bug!
    That wasn’t very clever of me – naturally it would memory leak.

    There is a way around it, as commented by Rob Bruce here.

    Cheers

Leave a Reply