My immediate thought was to look at the hex values in the file. The last time I did this was when I was using VAX/VMS systems. That was at least 6 years ago, but it may have been 10 years ago when I last looked inside a file in this way. 'hexdump' at the OS-X command line does the job, but not immediately in the way I wanted. I didn't bother to read the MAN pages, but decided to roll my own in Python instead:
#!/usr/bin/env python #!/usr/bin/env python # encoding: utf-8 import sys WIDTH = 24 # § def ascii(byte_value): if byte_value < 32: return {0: 'NUL', 1: 'SOH', 2: 'STX', 3: 'ETX', 4: 'EOT', 5: 'ENQ', 6: 'ACK', 7: 'BEL', 8: 'BS ', 9: 'TAB', 10: 'LF ', 11: 'VT ', 12: 'FF ', 13: 'CR ', 14: 'SO ', 15: 'SI ', 16: 'DLE', 17: 'DC1', 18: 'DC2', 19: 'DC3', 20: 'DC$', 21: 'NAK', 22: 'SYN', 23: 'ETB', 24: 'CAN', 25: 'EM', 26: 'SUB', 27: 'ESC', 28: 'FS ', 29: 'GS ', 30: 'RS ', 31: 'US ',}[byte_value] elif byte_value < 127: return ' %s '%chr(byte_value) else: return '%03d'%byte_value def hex_dump_line(bytes, index=0, width=WIDTH): if len(bytes)>0: hex_part = u"" text_part = u"" for byte_value in (ord(byte) for byte in bytes): hex_part += '%02X '%byte_value text_part += ascii(byte_value) print '%04X %s'%(index,hex_part) print ' %s\n'%(text_part) def hexdumpfile(filename, width=WIDTH): file_to_dump = open(filename, "rb") byte = file_to_dump.read(1) bytes = [] index = 0 while byte != "": bytes.append(byte) if len(bytes)==width: hex_dump_line(bytes,index=index,width=width) bytes = [] byte = file_to_dump.read(1) index +=1 hex_dump_line(bytes, index=index, width=width) if __name__=="__main__": hexdumpfile(filename=sys.argv[0], width=WIDTH)The output looks like:
0017 23 21 2F 75 73 72 2F 62 69 6E 2F 65 6E 76 20 70 79 74 68 6F 6E 0A 23 20 # ! / u s r / b i n / e n v p y t h o n LF # 002F 65 6E 63 6F 64 69 6E 67 3A 20 75 74 66 2D 38 0A 0A 69 6D 70 6F 72 74 20 e n c o d i n g : u t f - 8 LF LF i m p o r t 0047 73 79 73 0A 0A 57 49 44 54 48 20 3D 20 32 34 0A 0A 23 20 C2 A7 0A 0A 64 s y s LF LF W I D T H = 2 4 LF LF # 194167LF LF d
This allowed me to see the '§' was encoded in the file as C2A7, and sure enough C2 is the unicode value of 'Â' and A7 is the unicode value of '§'. When I open the file in TextMate, I see just the '§', but in Safari's View Source, I see '§'. It turns out that although I'm encoding correctly as UTF-8, I'd neglected to declare the encoding in the HTML . This fixes the problem:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>'
I'm not sure when I'll need to look inside a file again. Not for this issue, I hope. I'll recognise it and check the header first.
EDIT: I should've just looked in the app store. File Viewer is free and does pretty much what I wanted to do - quicker, better, prettier, etc.
No comments:
Post a Comment