I'm starting to collect my thoughts on the subject of virtual world interoperability system standards into this document, because the subject is becoming too big for me to keep in my head all at one time.
The basis of this document is to describe how the interoperability story of http://www.interopworld.com/members/node/22 can be implemented in reality, between heterogenous virtual world providers.
A persistent virtual world system is quite complex, involving both client and server software and hardware. It takes all the trappings of a networked multiplayer multimedia computer game, and adds persistency, communication, the ability for users to change the environment, and any number of services (collaboration, interaction, etc). There are several different technologies on the market that provide different trade-offs between values such as simulation accuracy, security, freedom, quality, cost, etc. The details of how the persistent world is simulated and updated on the participating client machines vary between different implementations. However, all of these platforms want to provide low-latency, real-time interaction between many participants. This real-time, 3D, interactive constraint makes virtual world technology very different from the 2D "text driven" web that came before it. Thus, different kinds of standards are necessary to make inteoperability work.
For interoperability to work in such a world, the server side of each system provider all need to agree on what the shared world looks like, and each provide a view of that shared world to the clients connecting to the same space. Because different servers have different capabilities, it makes most sense for each server to simulate the entities that are introduced by that server, and provide the other servers with information about how that entity affects the shared world. This way, all the servers simulate an overlapping area, and introduce their own entities into that shared area. Meanwhile, all of the entities in the area can be visualized to each client, through the exchange of entity effects, where basics such as position, looks and animation are included.
The main flow is as follows:
An alternate approach would be to create a "virtual world browser," which is a universal virtual world client, just like a web browser is a universal HTML client. Unfortunately, that's a terrible idea, for several reasons:
Instead hooking the servers together on the back end looks a lot more attractive:
Thus, this proposal will get to the interoperable, open "3D web" a lot quicker, with more diversity, fewer bugs, a lot cheaper than the proposal of a universal virtual world client.
The virtual world service hosting the session is known as the "master." The other virtual world services are known as "slaves." This relation is only intended to convey who initiates the session, and who connects to the session; both masters and slaves can introduce entities into the simulation, where all participants can see those entities.
There is one layer of indirection between the locator URL and the actual connection used for virtual world data. This allows systems to implement some amount of load balancing, and de-couples the web service for managing sessions from the provisioning of the actual session resources.
At some point, communication needs to step down from XML to something more compact and real-time. That's probably where the gateway has responded with protocol information. This can use the HTTP Upgrade/101 request format together with Connection: keep-alive. Thus, requests need to be POST so that web caches don't interfere.
Initiating a session might work something like:
Greeting sent from slave to master session service
<?xml version="1.0"?>
<greetings version="1.0" compatible="1.0">
''<!-- this is the ID that was separately exchanged for the session, where "session" ''
''identifies where, when, who, etc (like a "meeting id") -->''
<sessionid>123456</sessionid>
''<!-- I expect to stay for half an hour -->''
<duration format="seconds">1800</duration>
</greetings>
The session ID is part of the locator URL. The duration is a hint that the connecting host can give the master host -- it's not clear that there's a good way of coming up with this value, so it may not be necessary.
"Playbox" is the physical area of the simulation, in some coordinate system. For example, in WGS-84, it may be a longitude, a latitude, and some measurements of a bounding box. The master will not accept or forward updates for entities that go outside this playbox.
"Terrain" is the static (non-entity) geometry of the simulation. Typically, this will include the ground, buildings, trees, etc.
"Gateways" are hosts that can provide actual simulation data exchange. There may be one or more gateways for the same session, where a slave can choose an arbitrary gateway. If there's a preference, the first gateway in the response should be preferred; this allows the master to do simple round-robin load balancing, while allowing clients to re-establish a session connection to "the next" gateway if the first gateway in the response fails for some reason.
The master provides some credentials that will allow the slave systems to authenticate with the gateway, using some HMAC scheme. An alternative would be to use SSL for all communications, but that would not allow for UDP transport.
Here's a typical response from the web service to the slave:
<?xml version="1.0"?>
<connection version="1.0" compatible="1.0">
<sessionid>123456</sessionid>
<playbox>
<coordsystem uri="canonical-uri">WGS84</coordsystem>
<minimum>
<longitude>-123.0</longitude>
<latitude>37.0</latitude>
<height>-10</height>
</minimum>
<maximum>
<longitude>-122.0</longitude>
<latitude>38.0</latitude>
<height>1010</height>
</maximum>
</playbox>
<duration>
<starttime format="isodate">2008-05-18 10:00:00-8:00</starttime>
<endtime format="isodate">2008-05-18 13:30:00-8:00</endtime>
</duration>
<terrain>
<geometry>
''<!-- this is roughlythe geocentric center of the playbox -->''
<center format="Y-up">-4262200,2352400,2715300</center>
<uri>some-uri</uri>
</geometry>
</terrain>
<gateway>
<uri>some-uri</uri>
<credentials method="hash-auth">
<slaveid>9876</slaveid>
<nonce>12354567</nonce>
''<!-- hash of slaveid and sessionid with master-secret key -->''
<cookie>abcd</cookie>
</credentials>
</gateway>
</connection>
Once the slave has retrieved session information from the session service, it will connect to the indicated gateway. The initial connection will be using XML, but the HTTP Upgrade:/101 format is used to switch to a binary (less verbose, lower latency) format. It is possible to introduce an UDP connection at this point, but for version 1, it's probably simpler to keep it at TCP, and live with the high latency that will involve. Additionally, if this is done with Connection: keep-alive, there is some chance that Web proxies will actually let these requests through, which might be a useful way to get through restrictive firewalls.
The request looks something like:
Upgrade: entity-stream/1.0
<?xml version="1.0">
<connect version="1.0" compatible="1.0">
<sessionid>123456</sessionid>
<credentials method="hash-auth">
<slaveid>9876</slaveid>
<cookie>abcd</cookie>
''<!-- hash of slaveid, sessionid and nonce with session-
specific password (separately exchanged) -->''
<hash>cdcdcdcd</hash>
</credentials>
</connect>
After the 101 status is returned, the session will immediately switch to binary format.
It might be beneficial to provide for re-authentication on the entity
connection every so often. This should use the nonce provided by the
original session, the slave id, the separate password and some challenge
provided by the gateway to freshen the credentials.
Possibly worry about authenticating the master gateway to the slave,
too? or just use SSL for it all instead?
FRAMING (VERB SIZE DATA)+
FRAMING:
TOKEN
SIZE (including verbs)
SIZE (header size)
PACKETID
FLAGS
GLOBALTIME
NACKS (PACKETID)+
Integers sent as variable length. 7 bits data, 8th bit means
continuation. Big-endian order. If highest defined bit is set, it's
negative, except for 1-byte case.
Floats are sent as 32-bit or 64-bit IEEE floats, or as fixed format ints
(based on schema). Big-endian order.
Strings are sent as byte count (int as above) + UTF-8 data as byte stream.
VERB:
ADDENTITYTYPE TYPEID SCHEMAURI
SUBSCRIBETYPE TYPEID NVALUEIDS (VALUEID)+
UNSUBSCRIBETYPE TYPEID
ADDENTITY TYPEID ENTITYID NVALUES (VALUE)+
REMOVENTITY ENTITYID
UPDATEENTITY ENTITYID NVALUES (VALUE)+
VALUE:
VALUEID DATA
I looked around for a suitable binary protocol specification/standard method, but couldn't find anything good. Most protocols either just specify the bit fields as words (a la IP headers etc), or start wrapping data in too much gunk. The point is to transmit a minimum of data for each runtime update, but to allow for a rich set of properties on entities. Because entities will send property updates as "property id" plus "value," and the id is encoded as a variable-length int, it's useful to give the lowest property ids to the most-frequently changing properties.
In general, the receiving end of the entity will examine each schema that gets introduced, and decide to subscribe to some amount of the properties defined in that schema. Then, when an entity appears that is an instance of a schema that the receiver is subscribed to, the sending end will make sure to first introduce that entity, and then keep the reveiver up to date with changes in the property values subscribed to.
Most current binary marshaling methods require either significant additional metadata with each marshaled request, or require that each participant be updated with new binary data each time the schema of any one entity changes. Neither of those are desirable properties in a real-time virtual world entity protocol. To solve this problem, we require that an entity does not change its schema during its instantiated lifetime during a simulation session. We can then send the entity schema once, before the entity data, and then use that schema as a key to how to decode the data. If an entity wants to undergo a "live" schema update (which in the end will be inevitable, because these systems will have to stay up 24/7), the entity itself can be removed from the session and then re-introduced with a new schema or schema version.
Schema for an entity describes the properties and protocols of the entity. Schema may contain optional or variant subsections. The schema may implement a number of interfaces (which map to well defined properties), as well as custom extra properties. The property nids (numerical id) are not defined in the interface specification; they are specific to the entity schema in question. Versioning is of this schema for this provider, not for any "canonical" version of schema.
<?xml version="1.0"?>
<vwipschema target="entity" version="1.0" compatible="1.0">
''<!-- canonical URI name for the interface -->''
<interface restriction="optional">uri</interface>
<interface restriction="required">uri</interface>
<required>
<property nid="1">
<semantic>labelname</semantic>
<name>name</name>
<type>string</type>
</property>
<switch>
<property nid="2">
<semantic>staticmesh</semantic>
<name>mesh</name>
<type href="uri">mesh</type>
</property>
<required>
<property nid="3">
<semantic>animatedmesh</semantic>
<name>mesh</name>
<type href="uri">mesh</type>
</property>
<property nid="5">
<semantic>idleanimation</semantic>
<name>idle</name>
<type href="uri">animation</type>
</property>
<optional>
<property nid="6">
<semantic>walkanimation</semantic>
<name>walk</name>
<type href="uri">animation</type>
</property>
</optional>
</required>
</switch>
<optional>
<property nid="4">
<semantic>idlesound</semantic>
<name>breathingsound</name>
<type href="uri">sound</type>
</property>
</optional>
</required>
</vwipschema>
Mesh data and animation data needs to be in some known format. Perhaps COLLADA or a low-overhead X3D profile can be used. The geometry per animated/simulated/moving entity won't be too complex to send as a single chunk (as opposed to terrain, which could conceivably be "the entire Earth.") For development purposes, I'm proposing BLAT as XML, transferred as bzip compressed text. It's possible for the receiving end to translate from the interchange format to a runtime optimized format.
Interactions are "verbs," whereas entities are "nouns." This is not entirely true, because movement is a property of the entity (position and velocity), not an interaction, but it is close enough.
VERB:
ADDINTERACTION INTERACTIONID SCHEMAURI
SUBSCRIBEINTERACTION INTERACTIONID NVALUEIDS (VALUEID)+
UNSUBSCRIBEINTERACTION INTERACTIONID
INTERACT INTERACTIONID NVALUES (VALUE)+
Interaction schema looks like entity schema, but with the target "interaction" instead of "entity".
It would be useful if, instead of "ADDINTERACTION" and "ADDENTITYTYPE," there could be a "ADDENTITYSCHEMA" and "ADDINTERACTIONSCHEMA" which defined the set of entities and interactions that could happen in the given host. The other end could then compare that schema to something it already has, and wouldn't have to transfer all the capabilities each time it connected. However, keeping that out of 1.0 means it's simpler to get something done sooner.