I'm writing this because I spent too much time puzzling over it until I found a few gems. I probably won't use this IRL because it's clunky (and I think I have a way I prefer). but here's ACTUALLY how to write a JSON-serializable class, and why.

The official JSONEncoder docs say: "To use a custom JSONEncoder subclass (e.g. one that overrides the default() method to serialize additional types), specify it with the cls kwarg; otherwise JSONEncoder is used." Less-obvious (at least to me): JSONEncoder can *only* encode native Python types (list, dict, int, string, tuple ... that sort of thing); if you want to encode any other class, you *must* specify cls= in the dump/load methods, and you must provide a JSONEncoder subclass.

Specifically, json.dumps(data, cls=CustomClass) (where CustomClass is a subclass of JSONEncoder, with a default(self, o) method:

class SpecialDataClass(JSONEncoder):
    def default(self, obj):
        if isinstance(obj, self.__class__):
            return { 'data': obj.data } # or whatever serializable native type

Serialize with: json.dumps(specialData, cls=SpecialDataClass)

This "works," but it broke down (for me) when I wanted to have a class stored within a dict, and I wanted to serialize that dict. Like, json.dumps(dict) fails if dict contains a special class. The class doing the serialization has to know what it's serializing (so it can specify that keyword), but I don't want that level of binding in my code. Generally this will fail if you have >1 special, unrelated classes to deserialize. Yes you can write a custom deserializing class which handles all of them (sort of a Moderator), but c'mon.

What I did instead (and what will probably change): I provided my to-be-serialized class with "serialize" and "deserialize" methods; they return or take simple JSON-friendly data. The enclosing class (PersistentDict) now accepts a "cls=" argument. New values in that dict would be instantiated in this class, and on read/write all values are pre-serialized or post-deserialized for sending down to json's dump/load methods.

Working on my cluster-backups I wanted an RPC-like mechanism for communication from client -> server, but I required decoupled operation of the clients and server. I wanted to "magically" serialize data of arbitrary length, simplify the logic (abstracting out checking for error conditions etc), but if possible maintain long-lived TCP connections between client and server.

Above all else, code using datagrams should be very natural, readable, and very easy to use.

A simple "echo" server:

from datagram import *

s = DatagramServer("localhost", 5005)

while True:
    with s.accept() as datagram:
        message = datagram.value().upper()

The client:

from datagram import *

buffer = "a" * 1000

with Datagram(buffer, server="localhost", port=5005) as datagram:
    if datagram.send(): # or send(server="localhost", port=5000)

The datagramwill evaluate True or False based on the most recent connection. if datagram is great. datagram.value() returns its deserialized contents (which is either what was most recently sent or received). len() and contains/in treats it like a list, and it can return iterables like for token in datagram or if token in datagram

I have a handful of smallish machines in my home; the 'always on' systems include an iMac (about 25% used of 1TB drive), a few rpis (about 25-50% used of 2-4TB storage, give/take), and a mac mini (50% used of 4TB RAIDZ2 + 2TB Time Machine). "Traditional" backups involve slurping all the important data to a (single) redundant location, and traditionally requires that the single location be as large as the sum of storage in the network. So backups here would require yet another machine and another 8T drive; or I would have to host that drive on an existing machine and deal with the potential irritation of restoring that machine if/when it fails.

This seems silly; in aggregate my network is less than half utilized, so it should be able to back itself up, right? BUT my mini has a few filesystems with over 1T used... so I would have to split those up and back them up to different machines around the network, keep track of all that, and pull it back if/when I have to restore. That's silly. And OBTW my net reliability approaches zero as I spread subsets-of-backups over more hosts: single host failure would almost certainly cause loss of some portion of the backup.

Quick recap of things I'd want:

  1. not-another-machine to store copies of my data
  2. not a single, massive FS on which to store them
  3. no SPOF (backup host)
  4. no "where did the files go" tracking

Cluster filesystems are available to pool storage around the network; a bunch of smallish filesystems can contribute their storage to look like one big filesystem. Unfortunately, as above I have at least 4 different OSs and I think 5-6 flavors. FORTUNATELY they're all UNIX-like (Windows was excised in 2004).

What I want is a cluster backup solution: NO extra machine to store data, NO massive FS, NO single point of failure, and NO juggling of "this can fit over there (as long as it doesn't grow). This should be a clever (not smart) and dynamic system. I would also love the ability to keep N copies of data spread across my M machines, and I DO NOT want to care which of those machines are up when it comes time to restore.

So yeah, I got bored and made one. It doesn't mangle files (keeping them easy to verify, easy to restore) but it also doesn't require I have any single storage location as large as any single dataset. It is dynamic, will live on free storage in the network, and requires minimal configuration. I can even specify a minimum number of replicas per "source", and the system will maintain at least those; if there are two replicas, I can survive *any* single-host failure.

Ex: say a source filesystem has 10 files, 0.txt ... 9.txt, of 100GB each (1TB total). I have 3 hosts on the network with 700GB available each (total: 2.1TB available). A traditional system won't work here. The dynamic system, however, will self-organize and may come up with a solution like: host A: {0-7}.txt; host B: {3-9}.txt; host C: {0,1,2,8,9,4,7}.txt. Any single failure still leaves a complete replica on the network, and every file exists (at least) twice.

Implementation details: it's a simple client/server architecture, many:many. The configuration specifies source filesystems, backup locations, space (or reservation) for backups, and some options like "TCP port" or "number of replicas." That's it. Adding a new source (or backup host) is a one-line config change. The configuration is also backed up, so the change gets made once and propagates to all participating machines.

The "Server" is relatively dumb, passive, and really acts like a locking mechanism. Clients will request files, and can ask status ("are any files underserved" / "are any files overserved"). Clients are more clever (but ignorant of one another): they will try to copy underserved files, greedily. If they are over-capacity, they'll try to drop overserved files. There is no client-client communication, and no server-initiated client communication.

The upside is a very simple dynamic system which stores N copies of data over M machines; the downside, there may be some over-use of the network while clients settle in to a stable convergence. If, for instance, one (large) file is served to 4 clients initially, 2 of them may choose to drop it later as other files need coverage. This results in the lost network traffic and storage IO for the two extra copies. In practice this seems pretty minimal, the system tends to converge quickly and with some almost-trivial locking in the server, "efficiency" approaches 100%.

Copies are performed with rsync (see above: all UNIX-like). It handles checksumming on copy, and can smartly transmit partial data for changed files. The clients and server will exchange checksums (server never volunteers -- it only confirms). For IO sanity, checksums can be sampled (and the clients+servers can rate-limit their own IO).

Metadata communication is over TCP, and could be encrypted with minimal effort. It's all client->server including the likely case where the client and server are on the same host.

It's a "teach myself nontrivial Python" project. I have mixed feelings about the language, tho more positive than when I started.

I am moving to a new hosting provider because the old one is raising rates and annoyed me. This was a good excuse to upgrade a lot of software too.

If anything got lost, This email address is being protected from spambots. You need JavaScript enabled to view it.. The old stuff is available at http://austindavid.com/jm/ (new provider, old content) or http://hm.austindavid.com/jm/ (old provider)

Note: this was written for Joomla 1.x and is WAY stale; I use Joe's Word Cloud

A simple content plugin which generates a "keyword cloud" at the end of every article.

Keywords are selected from the article's metadata, comma-separated keywords. They are then sized 1-5 (definitions in css) based on the relative hit frequency in the keywords in all viewable articles.

Each keyword is linked out to a search page for that keyword.

This plugin has been validated against W3C XHTML/1.0 Transitional and CSS 2.1
Licensed under GPL2, and don't be a jackass. If you use it, This email address is being protected from spambots. You need JavaScript enabled to view it.

About this: it will first check if a number is "probably prime" using gmp_prob_prime(). It will then spend up to 1s testing sums of powers, printing any as appropriate.

GCF (greatest common factors) are calculated using my own, probably somewhat / not terribly inefficient algorithm.

There exists a virtual hat, from which one can draw names.
The idea: you add a bunch of names to the hat, and each of those people (virtually) draw a unique name from it. This can be used for a "secret santa" game or similar. Check it out, click an ad.