def dayCoder(self):: Hex Dump

I'm off to Paris next week for a meeting. This has prompted me to investigate a minor bug in some code that was displaying "Â§" instead of "§". It's really just a cosmetic defect until a customer sees it. At that point, it makes me look like a careless dick.

My immediate thought was to look at the hex values in the file. The last time I did this was when I was using VAX/VMS systems. That was at least 6 years ago, but it may have been 10 years ago when I last looked inside a file in this way. 'hexdump' at the OS-X command line does the job, but not immediately in the way I wanted. I didn't bother to read the MAN pages, but decided to roll my own in Python instead:

#!/usr/bin/env python
#!/usr/bin/env python
# encoding: utf-8

import sys

WIDTH = 24

# §

def ascii(byte_value):
    if byte_value < 32:
       return {0:  'NUL',
               1:  'SOH',
               2:  'STX',
               3:  'ETX',
               4:  'EOT',
               5:  'ENQ',
               6:  'ACK',
               7:  'BEL',
               8:  'BS ',
               9:  'TAB',
               10: 'LF ',
               11: 'VT ',
               12: 'FF ',
               13: 'CR ',
               14: 'SO ',
               15: 'SI ',
               16: 'DLE',
               17: 'DC1',
               18: 'DC2',
               19: 'DC3',
               20: 'DC$',
               21: 'NAK',
               22: 'SYN',
               23: 'ETB',
               24: 'CAN',
               25: 'EM',
               26: 'SUB',
               27: 'ESC',
               28: 'FS ',
               29: 'GS ',
               30: 'RS ',
               31: 'US ',}[byte_value]
    elif byte_value < 127:
        return ' %s '%chr(byte_value)
    else:
        return '%03d'%byte_value
        
def hex_dump_line(bytes, index=0, width=WIDTH):
    if len(bytes)>0:
        hex_part = u""
        text_part = u""
        for byte_value in (ord(byte) for byte in bytes):
            hex_part += '%02X '%byte_value
            text_part += ascii(byte_value)
        print '%04X %s'%(index,hex_part)
        print '     %s\n'%(text_part)
            
def hexdumpfile(filename, width=WIDTH):
    file_to_dump = open(filename, "rb")
    byte = file_to_dump.read(1)
    bytes = []
    index = 0
    while byte != "":
        bytes.append(byte)
        if len(bytes)==width:
             hex_dump_line(bytes,index=index,width=width)
             bytes = []
        byte = file_to_dump.read(1)
        index +=1
    hex_dump_line(bytes, index=index, width=width)

if __name__=="__main__":
   hexdumpfile(filename=sys.argv[0],
               width=WIDTH)

The output looks like:

0017 23 21 2F 75 73 72 2F 62 69 6E 2F 65 6E 76 20 70 79 74 68 6F 6E 0A 23 20 
      #  !  /  u  s  r  /  b  i  n  /  e  n  v     p  y  t  h  o  n LF  #    

002F 65 6E 63 6F 64 69 6E 67 3A 20 75 74 66 2D 38 0A 0A 69 6D 70 6F 72 74 20 
      e  n  c  o  d  i  n  g  :     u  t  f  -  8 LF LF  i  m  p  o  r  t    

0047 73 79 73 0A 0A 57 49 44 54 48 20 3D 20 32 34 0A 0A 23 20 C2 A7 0A 0A 64 
      s  y  s LF LF  W  I  D  T  H     =     2  4 LF LF  #    194167LF LF  d

This allowed me to see the '§' was encoded in the file as C2A7, and sure enough C2 is the unicode value of 'Â' and A7 is the unicode value of '§'. When I open the file in TextMate, I see just the '§', but in Safari's View Source, I see 'Â§'. It turns out that although I'm encoding correctly as UTF-8, I'd neglected to declare the encoding in the HTML . This fixes the problem:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>'

I'm not sure when I'll need to look inside a file again. Not for this issue, I hope. I'll recognise it and check the header first.

EDIT: I should've just looked in the app store. File Viewer is free and does pretty much what I wanted to do - quicker, better, prettier, etc.

def dayCoder(self):

Spot any errors? let me know, but Unleash your pedant politely please.

Thursday, 17 May 2012

Hex Dump

No comments:

Post a Comment

Blog Archive

Read these instead…

Followers