Friday, September 5, 2014

Non printable characters

Some tips to get rid of non printable characters  as often seen in text file created on Microsoft platforms.

 To see those characters you can use something like:

$ sed -n 'l' Genesis.txt 
[...]
6 \266 And God said, Let there be a firmament in the midst of the wat\
ers, and let it divide the waters from the waters.$
[...]

or using standard commands (depending on which one, the output may differ):

$ grep "Let the waters under "  Genesis.txt 
9 � And God said, Let the waters under the heaven be gathered together [...]
$ cat -v Genesis.txt   | grep "Let the waters"
9 M-6 And God said, Let the waters under the heaven be gathered together unto one place, and let the dry [land] appear: and it was so.^M

that \266 is the octal code for it. To list all special characters in the Latin-1 set see

$ man iso_8859-1

(Note that 'man ascii' will only display/list the original 7bit character set).

You can remove that using its octal code via sed (making a backup of the original file as well)

$ sed -i.bak 's/\o266//g' Genesis.txt

To remove the ^M there are other methods:

$ dos2unix Genesis.txt  Genesis_fixed.txt
$ strings Genesis.txt > Genesis_fixed.txt
$ tr -d $'\r'  <  Genesis.txt > Genesis_fixed.txt

or just use sed (note that you need to type Ctrl+V then Ctrl+M to get the right symbol)

$ sed -i.bak 's/\^M//g' Genesis.txt 

a more general approach found on the internet is to get rid of all but the ASCII octal values quoted in the command:

$ tr -cd '\11\12\15\40-\176' Genesis.txt > Genesis_fixed.txt

EOF