Virtual World System Interoperability Standard

I'm starting to collect my thoughts on the subject of virtual world interoperability system standards into this document, because the subject is becoming too big for me to keep in my head all at one time.

The basis of this document is to describe how the interoperability story of http://www.interopworld.com/members/node/22 can be implemented in reality, between heterogenous virtual world providers.

A persistent virtual world system is quite complex, involving both client and server software and hardware. It takes all the trappings of a networked multiplayer multimedia computer game, and adds persistency, communication, the ability for users to change the environment, and any number of services (collaboration, interaction, etc). There are several different technologies on the market that provide different trade-offs between values such as simulation accuracy, security, freedom, quality, cost, etc. The details of how the persistent world is simulated and updated on the participating client machines vary between different implementations. However, all of these platforms want to provide low-latency, real-time interaction between many participants. This real-time, 3D, interactive constraint makes virtual world technology very different from the 2D "text driven" web that came before it. Thus, different kinds of standards are necessary to make inteoperability work.

For interoperability to work in such a world, the server side of each system provider all need to agree on what the shared world looks like, and each provide a view of that shared world to the clients connecting to the same space. Because different servers have different capabilities, it makes most sense for each server to simulate the entities that are introduced by that server, and provide the other servers with information about how that entity affects the shared world. This way, all the servers simulate an overlapping area, and introduce their own entities into that shared area. Meanwhile, all of the entities in the area can be visualized to each client, through the exchange of entity effects, where basics such as position, looks and animation are included.

The main flow is as follows:

  1. User 1 decides to host an invited session. This could be a one-hour conference, or a multi-year "permanent" exhibition.
  2. User 1 describes the hosted session to "his" virtual world provider, who provisions for the session. This might be as simple as opening your island in Second Life, or involve renting server capacity by the hour using micropayments, or something else that is up to the provider (not the standard).
  3. User 1 provides information about the session to other users. This information includes a locator URL and some credentials (such as an access code, password, or similar).
  4. Users 2..N provide the locator URL and the credentials to their respective virtual world providers, after which they are "teleported" into the world as hosted by User 1. Each virtual world provider retains simulation responsibility for its users, within a copy of the given world.
  5. Each virtual world provider hosts its own users, and the objects created by those users. The different hosts exchange telemetry data with the hosting "master" server, who updates all participants.
  6. Any "non-native" object in a given virtual world provider server instance will be represented using some set of mesh, texture, animation and sound data; the rich model of that object will only execute on the host system for that object.

Model Justification

An alternate approach would be to create a "virtual world browser," which is a universal virtual world client, just like a web browser is a universal HTML client. Unfortunately, that's a terrible idea, for several reasons:

  1. Virtual worlds are about interacting with lots of people. That's very different from the model of a web browser, and thus requires a different approach.
  2. Virtual worlds are about real-time, interactive, low-latency 3D. This requires a very different approach than the current "web 2.0" web services.
  3. The amount of technology that goes into doing a 3D world is two orders of magnitude greater (or more) than what goes into doing a 2D web page. If there were a universal client, it would either have to re-implement everything that all the current providers already have to some arbitrary new standard, or it would have to standardize on one incumbent provider technology. Neither is likely to be popular among all the other virtual world providers.
  4. With universal clients, you don't actuall solve the "what does a unified world look like" problem. Each client would have to disconnect from one system, and connect to another system, for each new "place" that was visited.
  5. It will make innovation very hard, because you couldn't really get enough support on client and server side to get critical mass. As an example, it took 15 years before the majority (and not all!) web browsers supported client-side XSL with any consistency. Virtual worlds have orders of magnitude more complexity in them than XSL, so I hope we can find a better model.
  6. Compare to 3D online games, where everybody have to use the same version of the client software for the game to work. What if virtual world site A *required* client version 1.32, but site B *required* 1.29?

Instead hooking the servers together on the back end looks a lot more attractive:

  1. It's a problem a virtual world provider has to solve anyway, so it might as well be solved in a standard way.
  2. It allows the well-tuned server/client infrastructure that already exists to be re-used, providing a much better user experience.
  3. It allows a virtual world hosting provider to innovate, because the client and server can be kept in sync using whatever patching system that provider likes.
  4. The user has a relationship with a VWSP (Virtual World Service Provider), similar to he/she has a relationship with an ISP, or an e-mail provider. It's up to that single VWSP to provide a high-quality experience. That's a lot easier to manage than getting arbitrary levels of quality from arbitrary point-to-point connections on the web.
  5. The incremental engineering effort to get there from here is small, with no danger of favoring one platform over another, or requiring re-implementation of technology that's taken 10 years to build.

Thus, this proposal will get to the interoperable, open "3D web" a lot quicker, with more diversity, fewer bugs, a lot cheaper than the proposal of a universal virtual world client.

Initial Connection Negotiation

The virtual world service hosting the session is known as the "master." The other virtual world services are known as "slaves." This relation is only intended to convey who initiates the session, and who connects to the session; both masters and slaves can introduce entities into the simulation, where all participants can see those entities.

There is one layer of indirection between the locator URL and the actual connection used for virtual world data. This allows systems to implement some amount of load balancing, and de-couples the web service for managing sessions from the provisioning of the actual session resources.

At some point, communication needs to step down from XML to something more compact and real-time. That's probably where the gateway has responded with protocol information. This can use the HTTP Upgrade/101 request format together with Connection: keep-alive. Thus, requests need to be POST so that web caches don't interfere.

Initiating a session might work something like:

  1. Slave makes HTTP POST request to start session.
    1. Request contains session identifier.
    2. Request contains credentials.
      1. This should probably be done per client that wants to connect.
      2. Although the master can't really prevent leeching if allowed on the slave.
      3. Thus, strong DRM shouldn't be part of this spec.

Greeting sent from slave to master session service

  <?xml version="1.0"?>
  <greetings version="1.0" compatible="1.0">
    ''<!-- this is the ID that was separately exchanged for the session, where "session" ''
      ''identifies where, when, who, etc (like a "meeting id") -->''
    <sessionid>123456</sessionid>
    ''<!-- I expect to stay for half an hour -->''
    <duration format="seconds">1800</duration>
  </greetings>

The session ID is part of the locator URL. The duration is a hint that the connecting host can give the master host -- it's not clear that there's a good way of coming up with this value, so it may not be necessary.

  1. Master provides connection information
    1. Here is the playbox.
    2. Here are links to the terrain.
    3. Here are links to gateways.
      1. Different gateways for different areas of playbox?
      2. Start out with just one, for simplicity?
    4. Here are credentials for the gateways. (?)
    5. Session may have a defined duration in time.

"Playbox" is the physical area of the simulation, in some coordinate system. For example, in WGS-84, it may be a longitude, a latitude, and some measurements of a bounding box. The master will not accept or forward updates for entities that go outside this playbox.

"Terrain" is the static (non-entity) geometry of the simulation. Typically, this will include the ground, buildings, trees, etc.

"Gateways" are hosts that can provide actual simulation data exchange. There may be one or more gateways for the same session, where a slave can choose an arbitrary gateway. If there's a preference, the first gateway in the response should be preferred; this allows the master to do simple round-robin load balancing, while allowing clients to re-establish a session connection to "the next" gateway if the first gateway in the response fails for some reason.

The master provides some credentials that will allow the slave systems to authenticate with the gateway, using some HMAC scheme. An alternative would be to use SSL for all communications, but that would not allow for UDP transport.

Here's a typical response from the web service to the slave:

  <?xml version="1.0"?>
  <connection version="1.0" compatible="1.0">
    <sessionid>123456</sessionid>
    <playbox>
      <coordsystem uri="canonical-uri">WGS84</coordsystem>
      <minimum>
        <longitude>-123.0</longitude>
        <latitude>37.0</latitude>
        <height>-10</height>
      </minimum>
      <maximum>
        <longitude>-122.0</longitude>
        <latitude>38.0</latitude>
        <height>1010</height>
      </maximum>
    </playbox>

    <duration>
      <starttime format="isodate">2008-05-18 10:00:00-8:00</starttime>
      <endtime format="isodate">2008-05-18 13:30:00-8:00</endtime>
    </duration>
    <terrain>
      <geometry>
        ''<!-- this is roughlythe geocentric center of the playbox -->''
        <center format="Y-up">-4262200,2352400,2715300</center>
        <uri>some-uri</uri>
      </geometry>
    </terrain>

    <gateway>
      <uri>some-uri</uri>
      <credentials method="hash-auth">
        <slaveid>9876</slaveid>
        <nonce>12354567</nonce>
        ''<!-- hash of slaveid and sessionid with master-secret key -->''
        <cookie>abcd</cookie>
      </credentials>
    </gateway>
  </connection>

Slave to Gateway Connection

Once the slave has retrieved session information from the session service, it will connect to the indicated gateway. The initial connection will be using XML, but the HTTP Upgrade:/101 format is used to switch to a binary (less verbose, lower latency) format. It is possible to introduce an UDP connection at this point, but for version 1, it's probably simpler to keep it at TCP, and live with the high latency that will involve. Additionally, if this is done with Connection: keep-alive, there is some chance that Web proxies will actually let these requests through, which might be a useful way to get through restrictive firewalls.

The request looks something like:

  Upgrade: entity-stream/1.0
  
  <?xml version="1.0">
  <connect version="1.0" compatible="1.0">
    <sessionid>123456</sessionid>
    <credentials method="hash-auth">
      <slaveid>9876</slaveid>
      <cookie>abcd</cookie>
      ''<!-- hash of slaveid, sessionid and nonce with session-
           specific password (separately exchanged) -->''
      <hash>cdcdcdcd</hash>
    </credentials>
  </connect>

After the 101 status is returned, the session will immediately switch to binary format.

It might be beneficial to provide for re-authentication on the entity
connection every so often. This should use the nonce provided by the
original session, the slave id, the separate password and some challenge
provided by the gateway to freshen the credentials.

Possibly worry about authenticating the master gateway to the slave,
too? or just use SSL for it all instead?

Binary Entity Stream

  1. Master and Slave can both introduce entities
    1. Entity is introduced as a source-unique ID as an instance of a schema class.
    2. The schema class has a resolution mechanism if previously unknown.
    3. Entities can come and go while the session is live.
    4. Use template/value bindings in entity description.
    5. Entity class contains references to where to get resources.
      1. If a large amount of choice is available, perhaps only send data values when requested?
  FRAMING (VERB SIZE DATA)+
  
  FRAMING:
    TOKEN
    SIZE (including verbs)
    SIZE (header size)
    PACKETID
    FLAGS
    GLOBALTIME
    NACKS (PACKETID)+

Integers sent as variable length. 7 bits data, 8th bit means
continuation. Big-endian order. If highest defined bit is set, it's
negative, except for 1-byte case.

Floats are sent as 32-bit or 64-bit IEEE floats, or as fixed format ints
(based on schema). Big-endian order.

Strings are sent as byte count (int as above) + UTF-8 data as byte stream.

  VERB:
    ADDENTITYTYPE TYPEID SCHEMAURI
    SUBSCRIBETYPE TYPEID NVALUEIDS (VALUEID)+
    UNSUBSCRIBETYPE TYPEID
    ADDENTITY TYPEID ENTITYID NVALUES (VALUE)+
    REMOVENTITY ENTITYID
    UPDATEENTITY ENTITYID NVALUES (VALUE)+
  
  VALUE:
    VALUEID DATA

I looked around for a suitable binary protocol specification/standard method, but couldn't find anything good. Most protocols either just specify the bit fields as words (a la IP headers etc), or start wrapping data in too much gunk. The point is to transmit a minimum of data for each runtime update, but to allow for a rich set of properties on entities. Because entities will send property updates as "property id" plus "value," and the id is encoded as a variable-length int, it's useful to give the lowest property ids to the most-frequently changing properties.

In general, the receiving end of the entity will examine each schema that gets introduced, and decide to subscribe to some amount of the properties defined in that schema. Then, when an entity appears that is an instance of a schema that the receiver is subscribed to, the sending end will make sure to first introduce that entity, and then keep the reveiver up to date with changes in the property values subscribed to.

Entity Schema

Most current binary marshaling methods require either significant additional metadata with each marshaled request, or require that each participant be updated with new binary data each time the schema of any one entity changes. Neither of those are desirable properties in a real-time virtual world entity protocol. To solve this problem, we require that an entity does not change its schema during its instantiated lifetime during a simulation session. We can then send the entity schema once, before the entity data, and then use that schema as a key to how to decode the data. If an entity wants to undergo a "live" schema update (which in the end will be inevitable, because these systems will have to stay up 24/7), the entity itself can be removed from the session and then re-introduced with a new schema or schema version.

  1. Master and Slave exchange entity telemetry
    1. Movement.
    2. Animation.
    3. Avatar chat might be described as telemetry.
    4. VoIP as telemetry?

Schema for an entity describes the properties and protocols of the entity. Schema may contain optional or variant subsections. The schema may implement a number of interfaces (which map to well defined properties), as well as custom extra properties. The property nids (numerical id) are not defined in the interface specification; they are specific to the entity schema in question. Versioning is of this schema for this provider, not for any "canonical" version of schema.

  <?xml version="1.0"?>
  <vwipschema target="entity" version="1.0" compatible="1.0">
    ''<!-- canonical URI name for the interface -->''
    <interface restriction="optional">uri</interface>
    <interface restriction="required">uri</interface>
    <required>
      <property nid="1">
        <semantic>labelname</semantic>
        <name>name</name>
        <type>string</type>
      </property>
      <switch>
        <property nid="2">
          <semantic>staticmesh</semantic>
          <name>mesh</name>
          <type href="uri">mesh</type>
        </property>
        <required>
          <property nid="3">
            <semantic>animatedmesh</semantic>
            <name>mesh</name>
            <type href="uri">mesh</type>
          </property>
          <property nid="5">
            <semantic>idleanimation</semantic>
            <name>idle</name>
            <type href="uri">animation</type>
          </property>

          <optional>
            <property nid="6">
              <semantic>walkanimation</semantic>
              <name>walk</name>
              <type href="uri">animation</type>
            </property>
          </optional>
        </required>
      </switch>
      <optional>
        <property nid="4">
          <semantic>idlesound</semantic>
          <name>breathingsound</name>
          <type href="uri">sound</type>
        </property>
      </optional>
    </required>
  </vwipschema>

Mesh data and animation data needs to be in some known format. Perhaps COLLADA or a low-overhead X3D profile can be used. The geometry per animated/simulated/moving entity won't be too complex to send as a single chunk (as opposed to terrain, which could conceivably be "the entire Earth.") For development purposes, I'm proposing BLAT as XML, transferred as bzip compressed text. It's possible for the receiving end to translate from the interchange format to a runtime optimized format.

Interactions

Interactions are "verbs," whereas entities are "nouns." This is not entirely true, because movement is a property of the entity (position and velocity), not an interaction, but it is close enough.

  1. Master and Slave exchange interactions
    1. Collisions.
    2. Detonations.
    3. Signals.
    4. Global chat might be described as an interaction.
  VERB:
    ADDINTERACTION INTERACTIONID SCHEMAURI
    SUBSCRIBEINTERACTION INTERACTIONID NVALUEIDS (VALUEID)+
    UNSUBSCRIBEINTERACTION INTERACTIONID
    INTERACT INTERACTIONID NVALUES (VALUE)+

Interaction schema looks like entity schema, but with the target "interaction" instead of "entity".

It would be useful if, instead of "ADDINTERACTION" and "ADDENTITYTYPE," there could be a "ADDENTITYSCHEMA" and "ADDINTERACTIONSCHEMA" which defined the set of entities and interactions that could happen in the given host. The other end could then compare that schema to something it already has, and wouldn't have to transfer all the capabilities each time it connected. However, keeping that out of 1.0 means it's simpler to get something done sooner.