Monday, March 5, 2012

The joy of JAX-B and SAX

I decided to post this after an interesting journey to build an XML parsing component that would allow me to produce really good validation messages, better during deserialization but also for post-processing those kinds of rules you can't express in schema. So, typically if I want to get good error/validation with location information I would normally reach for SAX the Simple API for XML and build a ContentHandler to handle the events from the SAX parser. In Java I can ask for the setDocumentLocator method to provide a locator object that allows me to determine the line/column in the XML content at any time during my handler.

So, what's the problem? well two things, I cannot use this generically as each content handler has to be really coded to the type I want to deserialize, and SAX handlers can get complex to write and maintain. So, what about JAX-B which provides a nice simple API to deserialize XML directly into a Java object model? Well the problem is that while the implementation does usually provide location information for serialization errors not all errors that could be caught by JAX-B are caught and so we need to post-process and at that time we no longer have any location information, just our object model. To gather up all the errors generated by JAX-B we need to provide an implementation of the ValidationEventHandler and pass it to our unmarshaller using the method setEventHandler. Our implementation is pretty simple (and a commonly documented pattern) and only records all validation events into a list, ensuring that the handleEvent method always returns false or the parser will assume the error is fatal and stop parsing. Note that we actually transform the JAX-B ValidationEvent into our own ValidationError class, more on this later.

So, now we have captured all the validation events we need somehow to persist the line/column information for our object model so that when we do additional validation/consistency checking we can provide good, meaningful errors. Ideally we want to be able to record the location information outside the object model, we don't want to have to add a line and column field to each model object. Our solution is simple, we keep a Map of parsed object to Location so that we know the line/column of the start of the XML element that was used to deserialize the object. But how? After all JAX-B is pretty much a black-box, we ask it to deserialize an input source and we give it the expected type of the root object. For a start JAX-B does allow you to peek into the process via the setListener method which takes an instance of the Listener interface and allows you to take action on beforeUnmarshal and afterUnmarshal events. Interestingly though JAX-B has a mechanism whereby you can actually retrieve from your Unmarshaller the internal SAX handler used by JAX-B and then use the SAX API to drive the deserialization. The advantage of this is that we can wrap the JAX-B handler in our own handler before passing it to the SAX parser, overriding any SAX events we want. In actual fact we only need to override the setDocumentLocator call and keep a reference to the Locator object. The resulting class, DelegatingHandlerImpl extends the JAX-B Listener and implements the JAX-B UnmarshallerHandler (which in turn extends the SAX ContentHandler); the code for our handler is shown below and basically by using the before unmarshal method and the SAX locator we can build an internal map that tracks the location of each parsed object.
class DelegatingHandlerImpl extends Listener implements UnmarshallerHandler {

    private final UnmarshallerHandler unmarshallerHandler;
    private Locator locator;
    private final Map<Object, LocationImpl> locationMap;
    
    /**
     * {@inheritDoc}
     */
    @Override
    public void setDocumentLocator(Locator locator) {
        this.locator = locator;
        this.unmarshallerHandler.setDocumentLocator(locator);
    }

    /**
     * {@inheritDoc}
     */
    @Override
    public void beforeUnmarshal(Object target, Object parent) {
        super.beforeUnmarshal(target, parent);
        if (target != null && this.locator != null) {
            this.locationMap.put(target, new LocationImpl(this.locator.getLineNumber(), this.locator.getColumnNumber()));
        }
    }

The parser code is shown below, it is more complex than you would expect to see if you are used to SAX and certainly more than you would expect of JAX-B but mostly we're just setting up the JAX-B/SAX combination and plugging in our two additional classes, the DelegatingHandlerImpl (line 21) and ValidationEventHandlerImpl (line 24). We construct both the list of errors (line 4) and the object to location map (line 5) in the parser so that we can make them available to the caller after parsing is complete.

    public T parse(final InputSource input, final String schemaPath, final Class classOfT) 
            throws ParserConfigurationException, IOException {
        this.result = null;
        this.events = new LinkedList<ValidationError>();
        this.locationMap = new HashMap<Object, LocationImpl>();
        try {
            
            // Standard JAX-B
            final JAXBContext context = JAXBContext.newInstance(classOfT);
            final Unmarshaller unmarshaller = context.createUnmarshaller();
            // Setup schema validation if required
            if (schemaPath != null) {
                final SchemaFactory sf = SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI);
                final InputStream schemaIS = ClassLoader.getSystemResourceAsStream(schemaPath);
                final Schema schema = sf.newSchema(new StreamSource(schemaIS));
                unmarshaller.setSchema(schema);
            }
            // Now retrieve the SAX handler that JAX-B uses
            final UnmarshallerHandler unmarshallerHandler = unmarshaller.getUnmarshallerHandler();
            // Wrap it in our own handler
            final DelegatingHandlerImpl actualHandler = new DelegatingHandlerImpl(unmarshallerHandler, this.locationMap);
    
            // Now create and add an error handler
            final ValidationEventHandlerImpl errorHandler = new ValidationEventHandlerImpl(this.events);
            unmarshaller.setEventHandler(errorHandler);
            // Add a listener for before/after unmarshall events
            unmarshaller.setListener(actualHandler);
    
            // Now setup SAX
            final SAXParserFactory spf = SAXParserFactory.newInstance();
            spf.setNamespaceAware(true);
            
            // Start the SAX parser but using *our* new handler
            final XMLReader xmlReader = spf.newSAXParser().getXMLReader();
            xmlReader.setContentHandler(actualHandler);
            xmlReader.parse(input);

            // Retrieve the result from the handler, note that this is actually
            // the bridge back to JAX-B
            this.result = (T)unmarshallerHandler.getResult();
            
        } catch (UnmarshalException ex) {
            // ignore, these are reported in the validation errors.
        } catch (JAXBException ex) {
            this.events.add(new ValidationErrorImpl(Severity.FATAL, "JAX-B configuration exception", ex));
        } catch (SAXException ex) {
            this.events.add(new ValidationErrorImpl(Severity.FATAL, "IO error reading from InputSource", ex));
        }
        
        return this.result;
    }

I've posted all the source for the parser implementation as well as some basic (for now) test cases on Google Code in the validating-jaxb project.

Friday, February 3, 2012

Amazon S3, more than a blob store

Having built a number of RESTful services I have developed the storage for resources a number of times a number of different ways. It's important to realize that there is a progression that the first few projects go through until you work out the patterns that really work. The easy ones are obvious, keep the HTTP methods simple, avoid adding crazy request/response headers, use ETags, if you need content negotiation only use the standard Accept/Content-Type mechanism. The fun starts when you try and wrap your head around updates to resources, and the easiest way to get yourself out of the spiral discussion of partial updates, support for PATCH and all manner of other craziness is to treat all resources as technically immutable and every PUT to the resource creates a new version of the resource. This works well as the version identifier becomes a very stable ETag and with a little effort you can expose a URI for each version allowing you to look back and forward through the history of a resource. For a lot of projects I've used the resource store to store temporary resources (sometimes as a queue, the output from process one is put into the store to be picked up by process two) and it's annoying to keep having to write another "reaper" process to remove expired resources.

Recent changes to the Amazon S3 service have really made this a viable resource store, it's not just an online file store any more. While S3 has supported metadata associated with objects and a great API for query it's the additions to the API recently that

  • Versioning, it's been there a while actually but not many people use it. Basically you can turn versioning on bucket-by-bucket to store all versions of every object in the bucket. The query API is now extended so as well as listing all objects in a bucket you can now list all the versions of an object.
  • Expiration, you can now specify an expiration time for an object and S3 will automatically remove the object. 
  • Multi-Delete, if you have to still perform your own clean-up operation you can now query for all "expired" or otherwise superfluous resources and delete a set of resources in a single API call to S3.
This makes S3 a much more complete solution and certainly I hope I don't have to write another RESTful resource store any time soon, at least anything more than a domain wrapper around S3.