Regex for Japanese text letter counting

Japanese text can be thought of a mixture of Chinese characters known as Kanji, and the two phonetic alphabets of Hiragana and Katakana. Japanese text has no white spaces between words and one Kanji character can stand in for a whole word. This makes it difficult calculate how to fairly change clients for Japanese to English translation work. In Japan, the loosely agreed procedure is to charge by the letter, but when charging for every single letter, there are some characters that must be removed before calculating the total per letter fee. Some examples are English letters, numbers, white spaces, carriage returns, periods and other symbols in both Japanese and English. Removing these characters changes the text below from 135 characters

Revision History:
1.2.6.      2015-1-29

to 72 characters.


A significant difference when the translation work is priced by letters.

So how exactly do you remove all the cruft for the calculation? You could combing through the text yourself and manually delete letters one by one but Ain't Nobody Got Time for That! So regex to the rescue. If the text is in Unicode the following regex will match any characters, including symbols, that are not Kanji, Katakana, or Hiragana.


Use it in a text editor that has a regex friendly find and replace, like TextMate and done.

find and replace with regex