Entity Type1 Block Encoding

Introduction

For a long time we have used XML storage in the Blobs for many of the Entities in Sakai. This has given us flexability and the ability to add properties to entities without the need for schema changes. Provided we have added sufficient supporting columns for searching, sorting and aggregation, it is a relatively efficient way of storing and retrieving unstructured data from the database. However where we have not provide those columns is causes problems, and the cost of parsing the XML to load the entity into memory is excessive and unacceptable. Some recent work in this area tries to address this.

Encoding

XML parsing may be great for inter change but even the fastest XML parsers cannot compete for binary byte[] serializations. The Type1 Block Encoding is an encoding that serializes the properties of an entity to a byte[] using the standard DataOutputStream. The Type1 relates to the format of the data block, and the block relates to the block structure of the data block.

The data block is encoded in a byte[] and start with a byte sequence (BLOB_ID) that identifies the entity type of the block. The byte sequence for content collections is the unicode single bytes "CHSBCE" and "CHSBRE" for content resources. The intended encoding is that the first 3 characters denote the service and the next 3 denote the entity.

The remainder of the datablock is a byte[] containing byte values.

The datablock is read and written with DataOutputStream and DataInputStream. The first Int is (readInt()) is the Type of the block. A parser should only attempt to read data blocks that match its type. The next int is the Block ID within that type, after that each Block is defined by the implementation of the reader or writer although I have tried to share block ID's in the same service. So CHS Resources and CHS Collections have different BLOB_ID's but for the Same Type ID (Type 1) the blocks are encode in the same way. The order of the blocks should be assumed to be arbritary and where entities are hierarchically structured blocks may be nested. For instance many entities have Properties. So ContentResources put their Properties into BLOCK_ID=4. This block is then processed by ResourceProperties Serializer class that has its own internal TYPE_ID and BLOCK_ID's. This nesting of blocks enables us to replicate the hierarchical structure of XML without the parsing overhead.

Content Collection

BLOB_ID CHSBCE unicode single byte sequence
TYPEID(1) Type Identifier =1 Int
BLOCK1(10) Block Identifier value = 10 Int
  id UTFString
  resourceType UTFString
  access UTFString
  hidden boolean
BLOCK2(11) Block Identifier value = 11 Int
  releaseDate Long
  retractDate Long
BLOCK3(12) Block Identifier value = 12 Int
  number of Groups Int
  array of groups array of UTFString
BLOCK4(13) Block Identifier value = 13 Int
  Properties Block
BLOCK_END(2) End marker value = 2 Int

Content Resource

BLOB_ID CHSBRE unicode single byte sequence
TYPEID(1) Type Identifier =1 Int
BLOCK1(10) Block Identifier value = 10 Int
  id UTFString
  resourceType UTFString
  access UTFString
  hidden boolean
BLOCK2(11) Block Identifier value = 11 Int
  releaseDate Long
  retractDate Long
BLOCK3(12) Block Identifier value = 12 Int
  number of elements in following array Int
  array of groups array of UTFString
BLOCK4(13) Block Identifier value = 13 Int
  Properties Block Nested Block
BLOCK5(14) Block Identifier value = 14 Int
  contentType UTFString
  contentLength UTFString
  filePath UTFString
BLOCK6(15) Block Identifier value = 15 Int
  length of following byte array Int
  byte array representing body array of byte
BLOCK_END(2) End marker value = 2 Int

ResourceProperties

BLOB_ID N/A Properties dont exist except as a sub block of an entity
TYPEID(1) Type Identifier =1 Int
BLOCK1(100) Block Identifier value = 100 Int
  Number of property blocks (BLOCK2 & BLOCK3) to follow Int
BLOCK2(101) Block Identifier value = 101 Int
  Name of the Property UTFString
  Value of the Property UTFString
BLOCK3(102) Block Identifier value = 102 Int
  Name of the Property UTFString
  Size of the following Array Int
  Array of Values array of UTFString

Storage

Data Output streams producer byte[]. We did try to store this in a char[] by unsigning the byte[], but this proves problematic as it relies on a database connection that does not corrupt the chars in the range 128-255. (ie pure UTF8). There are some production system in existance that are not using UTF8. Hence we have switch to placing the byte[] into a new BLOB, BINARY_ENTITY.

Performance.

Performance is the only reason we are doing this. If we look at the overhead of parsing and serializing, removing the cost of storing the input and output data.

Technique Memory Overhead per entity CPU Time per Entity
XML DOM Parse 389K 3.5ms
XML DOM Serialize 79K 3.1ms
SAX Parse 48K 1.2ms
Binary Parse 23 bytes 16us
Binary Serialize 26 bytes 18us

The memory consumption on these tests was calculated using Runtime.freeMemory() before and after parsing/serializing 16000 entities. It looks like the JVM reuses most of the code and memory footprint in DataOutputStream and DataInputStream so its memory requirement is shared over all 16000 entities, hence the low overhead.

Inside Sakai, reporting the cost of parsing and serializing entity too and from memory in the BaseDbDoubleStorage, BaseDbSingleStorage (and associated) classes and averaging over 100 operations we see the real cost to Sakai.

Technique Memory Overhead per entity CPU Time per Entity
XML DOM Parse 402K 3.5ms
XML DOM Serialize 82K 3.1ms
SAX Parse 52K 1.2ms
Binary Parse 4K bytes 62us
Binary Serialize 5K bytes 65us

When doing these test there was variation of about +-10% on both the memory and CPU time.

Testing and Conversion

The new BLOB in the entity storage field is used for storing the binary entity. This new blob is called BINARY_ENTITY. When Content hosting start, it checks for this column and the xml column and acts acordingly. If it finds only the XML column, then it uses the default SAX/DOM BaseDbSingleStorage. If it find both the XML column and the BINARY_ENTITY column, it uses a new StorageUser class BaseDbDualSingleStorage, that reads from whichever column is not null and writes to the BINARY_ENTITY column. On writing to the BINARY_ENTITY column it sets the XML column to NULL in the same update, which indicates that the data has been migrated. Once all the records for XML are NULL, that column can be removed from the table in question. If it finds only the BINARY_ENTITY column it uses the BaseDbBinarySingleStorage StorageUser class.

The behaviour of BaseDbDualSingleStorage can be modified by the migrateData property. If true (default) it will write to the BINARY_ENTITY column. If false it will check that there are no non null entries in the BINARY_ENTRY column, and write to the XML column, disabling data migration. When the migrateData property is false, it will refuse to startup if there are any non null entries in the BINARY_ENTITY column to avoid loosing data.

There are a battery of tests within Content that verify that the serializations work and do not loose information. In addition there is a conversion utility. The implementation in content hosting can read and write both XML and Binary and is configured by default to write Binary content to the Database.

Offline Conversion.

There is an offline conversion utility that enables the conversion of the database at the command line on a separate machine. This utility is designed to run with the sakai appservers live and does not require them to be offline. It assumes that the BINARY_ENTITY columns are already present in the schema. When it runs it builds table containing items that have not been converted, and works through that list in batches of 100, checkpointing after each 100 conversions. It might be possible to run this utility on a number of machines in parallel, but this has not been tested and is probably not recommended.

To run, you need to configure a configuration file with the database connection details, and you may need to edit the SQL in that file to suit your target database. This file, a copy of which can be found at https://source.sakaiproject.org/svn/content/trunk/content-impl/impl/src/java/org/sakaiproject/content/impl/serialize/impl/conversion/upgradeschema.config
(the default) and here

# UpgradeSchema Control File (Default)
# Conversion needs the database connection details
dbDriver=com.mysql.jdbc.Driver
dbURL=jdbc:mysql://127.0.0.1:3306/sakai22?useUnicode=true&characterEncoding=UTF-8
dbUser=sakai22
dbPass=sakai22

convert.0=FileSizeResourcesConversion
convert.0.handler.class=org.sakaiproject.content.impl.serialize.impl.conversion.FileSizeResourcesConversionHandler
convert.0.create.migrate.table=create table content_res_fsregister ( id varchar(1024), status varchar(99) )
convert.0.drop.migrate.table=drop table content_res_fsregister
convert.0.check.migrate.table=select count(*) from content_res_fsregister where status <> 'done'
convert.0.select.next.batch=select id from content_res_fsregister where status = 'pending' limit 100
convert.0.complete.next.batch=update content_res_fsregister set status = 'done' where id = ?
convert.0.mark.next.batch=update content_res_fsregister set status = 'locked' where id = ?
convert.0.populate.migrate.table=insert into content_res_fsregister (id,status) select RESOURCE_ID, 'pending' from CONTENT_RESOURCE
convert.0.select.record=select XML from CONTENT_RESOURCE where RESOURCE_ID = ?
convert.0.select.validate.record=select XML from CONTENT_RESOURCE where RESOURCE_ID = ?
convert.0.update.record=update CONTENT_RESOURCE set CONTEXT = ?, FILE_SIZE = ? where RESOURCE_ID = ? 

convert.1=Type1BlobCollectionConversion
convert.1.handler.class=org.sakaiproject.content.impl.serialize.impl.conversion.Type1BlobCollectionConversionHandler
convert.1.create.migrate.table=create table content_col_t1register ( id varchar(1024), status varchar(99) )
convert.1.drop.migrate.table=drop table content_col_t1register
convert.1.check.migrate.table=select count(*) from content_col_t1register  where status <> 'done'
convert.1.select.next.batch=select id from content_col_t1register where status = 'pending' limit 100
convert.1.complete.next.batch=update content_col_t1register set status = 'done' where id = ?
convert.1.mark.next.batch=update content_col_t1register set status = 'locked' where id = ?
convert.1.populate.migrate.table=insert into content_col_t1register (id,status) select COLLECTION_ID, 'pending' from CONTENT_COLLECTION where BINARY_ENTITY IS NULL
convert.1.select.record=select XML from CONTENT_COLLECTION where COLLECTION_ID = ?
convert.1.select.validate.record=select BINARY_ENTITY from CONTENT_COLLECTION where COLLECTION_ID = ?
convert.1.update.record=update CONTENT_COLLECTION set XML = NULL, BINARY_ENTITY = ?  where COLLECTION_ID = ?

convert.2=Type1BlobResourceConversion
convert.2.handler.class=org.sakaiproject.content.impl.serialize.impl.conversion.Type1BlobResourcesConversionHandler
convert.2.create.migrate.table=create table content_res_t1register ( id varchar(1024), status varchar(99) )
convert.2.drop.migrate.table=drop table content_res_t1register
convert.2.check.migrate.table=select count(*) from content_res_t1register  where status <> 'done'
convert.2.select.next.batch=select id from content_res_t1register where status = 'pending' limit 100
convert.2.complete.next.batch=update content_res_t1register set status = 'done' where id = ?
convert.2.mark.next.batch=update content_res_t1register set status = 'locked' where id = ?
convert.2.populate.migrate.table=insert into content_res_t1register (id,status) select RESOURCE_ID, 'pending' from CONTENT_RESOURCE where BINARY_ENTITY is NULL
convert.2.select.record=select XML from CONTENT_RESOURCE where RESOURCE_ID = ?
convert.2.select.validate.record=select BINARY_ENTITY from CONTENT_RESOURCE where RESOURCE_ID = ?
convert.2.update.record=update CONTENT_RESOURCE set CONTEXT = ?, FILE_SIZE = ?, XML = NULL, BINARY_ENTITY = ? where RESOURCE_ID = ?

There is a command line runner, currently as a bash script that runs the conversion utility. It assumes that you maven repo contains a M2 build of sakai at ~/.m2/repository

To run

cd SAKAI_SOURCE/content
# Run maven to check that we are uptodate
mvn -o clean install
runconversion.sh myconversion.config

The script is in SVN but at the time of writing is

#!/bin/sh
CLASSPATH="$CLASSPATH:$HOME/.m2/repository/commons-logging/commons-logging/1.0.4/commons-logging-1.0.4.jar"
CLASSPATH="$CLASSPATH:$HOME/.m2/repository/commons-dbcp/commons-dbcp/1.2.2/commons-dbcp-1.2.2.jar"
CLASSPATH="$CLASSPATH:$HOME/.m2/repository/commons-pool/commons-pool/1.3/commons-pool-1.3.jar"
CLASSPATH="$CLASSPATH:$HOME/.m2/repository/mysql/mysql-connector-java/3.1.11/mysql-connector-java-3.1.11.jar"
CLASSPATH="$CLASSPATH:$HOME/.m2/repository/org/sakaiproject/sakai-util-api/M2/sakai-util-api-M2.jar"
CLASSPATH="$CLASSPATH:$HOME/.m2/repository/org/sakaiproject/sakai-entity-api/M2/sakai-entity-api-M2.jar"
CLASSPATH="$CLASSPATH:$HOME/.m2/repository/org/sakaiproject/sakai-entity-util/M2/sakai-entity-util-M2.jar"
CLASSPATH="$CLASSPATH:$HOME/.m2/repository/org/sakaiproject/sakai-content-api/M2/sakai-content-api-M2.jar"
CLASSPATH="$CLASSPATH:$HOME/.m2/repository/org/sakaiproject/sakai-content-impl/M2/sakai-content-impl-M2.jar"
CLASSPATH="$CLASSPATH:$HOME/.m2/repository/org/sakaiproject/sakai-util-log/M2/sakai-util-log-M2.jar"
CLASSPATH="$CLASSPATH:$HOME/.m2/repository/log4j/log4j/1.2.9/log4j-1.2.9.jar"

java $JAVA_OPTS  \
      -classpath "$CLASSPATH" \
      org.sakaiproject.content.impl.serialize.impl.conversion.UpgradeSchema "$@"

This conversion utility starts a very small part of the sakai framework to do its work.

Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.