A method for roughly estimating the token count for ChatGPT

The method large language models like ChatGPT use to encode text is pretty interesting. The idea is that the models don't get the text e.g. character by character, but that frequent sequences are encoded as a token. A token can be a single character, a part of a word or a single word, or even rather long sequences like ##############. The exact tokens depend on the model. For the current versions of ChatGPT that's encoded into a In the case of ChatGPT that's currently described in the cl100k_base.tiktoken file, which contains lines like

IHByZXZpb3VzbHk= 8767
Qkk= 8768
IFNlY3JldA== 8769

which is, if you undo the base64 encoding (here is the token file decoded):

previously 8767
BI 8768
Secret 8769

Any text that is served to the model is then looked up in that table, and the longest character sequence from the start that is contained in the table is the next token. That is, " previously Secret" would be the tokens 8767 8769.

Now, that'd be unimportant if ChatGPT wasn't as tight on context as it currently is - in many models 4000 token (or 32000 in the newest models, Hooray!) is all context it can process, including response. So you often have to estimate how many tokens a text has before submitting it or maybe abbreviating it.

In a serious application you't use a library like tiktoken or JTokkit for that. But in small scripts or mini applications, like all many of those I pit in my ChatGPT toolset, you wouldn't want to import a library or munch that 1.8MB data for quickly running a script. So I was wondering whether it's possible to estimate the token count for a text a bit easier. OpenAI mentions that a token is about 3/4 of a word in english plain text on average, but that get's pretty inaccurate in other languages which have large words like German, and is pretty much nonsense for source code. After a while I found a way which is at least somewhat better.

The basic idea is to ask for each character how much of a token they make up. That question doesn't make too much sense, but leads us to a something that comes reasonably close. I collected a couple of types of text and source code from my computer and wrote something to determine for each character the average of 1/{token length} for tokens that contain that character, and put characters with close values for that into one group:

spaces that are following a space: 0.081
NORabcdefghilnopqrstuvy and single space : 0.202
CHLMPQSTUVfkmspwx : 0.237
-.ABDEFGIKWY_rtz{ü : 0.304
!$&(/;=JX`jn}ö : 0.416
"#%)*+56789<>?@Z[]^|§«äç’ : 0.479
,01234:~Üß and characters > 255 : 0.658
other characters: 0.98

If you just go through each file and add those numbers for all the characters, that comes reasonably close to the real token count:

file type | real token count vs. guesstimate
css 123491 vs. 103405
html 232691 vs. 243483
java 671616 vs. 757334
js 838884 vs. 825870
md 60583 vs. 59638
xml 912672 vs. 857563

Of course, that is pretty specific to my text- and codebase. Do you have better ideas?

In case you care: that estimation function and the statistics creating function are in my token counting script. (The implementation of real token counting in there is embarassingly inefficient, but was quick enough for my current purposes. That's algorithmically pretty interesting, but alas, there are already lots of good libraries for that. :-)

UPDATE: I just discovered that my tokenization implementation isn't quite complete, as there is a merging process that I didn't implement, so the you might want to take this with one or rather two grains of salt. I don't have the time to fix this right now, but I think it's interesting enough to leave it around, anyway.