I’m coming to the conclusion that there’s actually no such thing as “plain data;” it always has some metadata attached. If it doesn’t, it might be displayed incorrectly, and then a human needs to interfere to determine the correct metadata to apply to fix the problem. (Example: View → Character Encoding in Firefox.) Pushed to the extreme, even “just numbers” have metadata: they can be encoded as text, a binary integer/float (IEEE 754 or otherwise) of some size/endianness, or an ASN.1 encoding.
Another conclusion I’m reaching is that HTTP conflates all kinds of metadata. Coupled with the lack of self-contained metadata in file formats and filesystems, things start to accumulate hacks.
Thursday, January 26, 2012
Wednesday, January 25, 2012
What If: Weak Memory Pages
Raymond Chen wrote about the "what if everybody did this?" problem of applications written to consume up to some threshold of memory and free some of it under pressure: if multiple applications have different thresholds that they're trying to maintain, then the one with the smallest-free threshold wins. Of course, the extreme of this is a normal application that doesn't try to do anything fancy, which acts like it has a negative-infinity threshold. If it never adjusts its allocations in response to free memory, then it always wins.
Some of the solutions batted around in the comment thread involve using mmap() or other tricks to try to get the OS to manage the cache, but this brings up its own problems.
Some of the solutions batted around in the comment thread involve using mmap() or other tricks to try to get the OS to manage the cache, but this brings up its own problems.
Wednesday, January 18, 2012
Perl and Unicode in Brief
Perl requires a knob for every I/O, and expects you to set them all correctly yourself. By default, they're all off (Unicode-unaware) for backwards compatibility.
- If you want to handle Unicode and avoid The Unicode Bug, in which your strings sometimes act like they aren't actually Unicode: in perl 5.12+,
use feature 'unicode_strings';
. For older perl, see Unicode::Semantics, or use utf8::upgrade by hand. These methods achieve their task by forcing "the UTF-8 flag" on for the string. - If you want strings in your source text with non-ASCII: save it as a utf-8 encoded file and
use utf8;
. Or you can encode Unicode code points with hex-escapes,\xae
→ ®, or\x{30ab}
→ カ. There are technically other options, which have additional drawbacks (utf-16 breaks the #! line; latin-1 is restricted to latin-1 unless you decode it yourself.) - If you want to print to a UTF-8 aware environment like your terminal emulator or CGI STDOUT after issuing a
Content-Type: text/html; charset=utf-8
header: setting UTF-8 on the filehandle withbinmode(STDOUT, ':utf8')
is the minimum, but:encoding(utf-8)
instead of:utf8
makes stricter guarantees that real code points are coming out. - If you want to read a UTF-16 encoded document into a Unicode string with minimal fuss:
open(FH, '< :encoding(utf-16)', $name)
. Note that the document has to be correctly encoded. You can use the Encode module'sdecode
function if you need finer control over error behavior, but that's naturally more fuss:use Encode; open(FH, '<', $name); while (<fh>) { $line = decode($_, 'utf-16', $POLICY); ... }
- If you want to convert a Unicode string to a specific set of bytes for some encoding-unaware module to throw on the wire, use the encode function from the Encode module:
use Encode; $message->attr('content-type.charset', 'utf-16'); $message->data(encode("UTF-16", $body));
(This example would be for MIME::Lite, if you're curious.) - If you want to read a file encoded with charset X, into a string encoded with charset Y, I've found no instant way to do this. It's probably best to pass the input-encoding along as the output-encoding if at all possible. But you might find the Encode module's
from_to()
, or string-IO as inIO::File->new(\$out, '>:')
, or maybe a whole PerlIO filter as in PerlIO::code helpful if you can't. - If you see "Wide character in ..." warnings, then you passed a string with code points >=0x100 to something that expected a byte string of some sort: either really latin-1, or an encoded string.
- If you see longer strings of gibberish where you expected sensible non-ASCII characters, then you have probably double-encoded, either literally, or by printing an encoded string to a filehandle which does encoding.
- If you see the Unicode replacement character in a stream that should be UTF-8, you haven't encoded at all, such as printing a byte string on a raw filehandle in an environment expecting UTF-8. Most likely, the filehandle should have an encoding set on it, per point #3 above, though that may cause #8 on other strings you've printed.
- If you are using modules, they each may or may not deal with Unicode. DBD::mysql has the
mysql_enable_utf8
option; Email::MIME accepts encoded strings via body, and decoded ones through body_str, but for the latter, you must also set the charset and encoding attributes (which correspond to the charset of Content-Type, and the Content-Transfer-Encoding, respectively.) MIME::Lite does not handle decoded strings at all and hopes for the best.
Friday, January 13, 2012
A nice vim highlighting hack
I wanted to highlight places where control flow could be redirected in my perl code, so I hacked up my personal colorscheme file to highlight Exceptions specifically:
Now, I just needed to define the things I wanted highlighted as Exception*. Thus, the newly added
The last line isn't related to the above, but it recolors my/local/our in Preprocessor Blue instead of Statement Yellow. They do, after all, affect the state of the compiler at parse time.
* This means that I'm going to open a non-Perl file sometime and weird things will have Exception highlighting. Nobody notices the subtle differences when it's all Statement colored by default.
hi Exception ctermfg=white ctermbg=blue
Now, I just needed to define the things I wanted highlighted as Exception*. Thus, the newly added
~/.vim/after/syntax/perl.vim
:" flow control highlighting syn keyword perlStatementCtlExit return die croak confess last next redo syn keyword perlStatementWarn warn carp cluck hi link perlStatementCtlExit Exception hi link perlStatementWarn Statement " and i'm tired of everything being yellow hi link perlStatementStorage Define
The last line isn't related to the above, but it recolors my/local/our in Preprocessor Blue instead of Statement Yellow. They do, after all, affect the state of the compiler at parse time.
* This means that I'm going to open a non-Perl file sometime and weird things will have Exception highlighting. Nobody notices the subtle differences when it's all Statement colored by default.
Wednesday, January 11, 2012
Layer 7 Routing: HTTP Ate the Internet
In the beginning was TCP/IP, and the predominant model was that servers would listen for clients using a pre-established port number. Then came Sun RPC, in which RPC servers were established dynamically, and listened on semi-random ports (still, one port per service provided); the problem was solved by baking the port mapper into the protocol. The mapper listens on a pre-established port, and the client first connects there to inquire, "On what port shall I find service X?"
Then came HTTP, the layer 6 protocol masquerading as layer 7.
Then came HTTP, the layer 6 protocol masquerading as layer 7.
Subscribe to:
Posts (Atom)