Optical tracking is the next frontier of VTubing

In August 2022 a new streamer debuted on Twitch: Kellogg’s Tony the Tiger. Tony played the game Fall Guys while interacting with prominent streamers in real time as part of a collaboration between Kellogg’s and Twitch’s Brand Partnership team. No costume or make-up was involved – rather, the stream was made possible thanks to a live motion capture performance that was used to animate a digital Tony avatar.

This wasn’t the first time a digital avatar streamed in real time, but it was a very clear sign that ‘VTubing’ has reached the mainstream.

The trend has been growing for a while. Amazon says that last year VTubing content grew 467 percent year on year on Twitch, and in 2020 some 38 percent of YouTube’s 300 most profitable channels were from VTubers.

On the technological side, VTubers are still finding their feet. At the low end, content creators can spend a few hundred dollars to get a rigged 2D avatar and a good quality webcam and they’ll be able to stream a crudely animated character. In Japan, however, the most popular VTubers are already being produced by studios such as Polygon Pictures using motion capture hardware, and a number of English language VTubers are now turning to more powerful motion analysis solutions too.

Take a look at the video below. 



Matt Workman is at the forefront of this trend, demonstrating how, with minimal previous experience, an individual streamer or DIY indie studio can produce realtime 3D content using a Vicon setup.

Workman’s background is in developing 3D characters and environments in Unreal Engine. His app, Cine Tracer, is a real-time cinematography simulation used by filmmakers for storyboarding, and he’s currently creating clothing for Epic’s metahumans – the most advanced digital humans currently in the public sphere. That work wasn’t the main driver for his interest in optical capture, though.

“The original interest with the Vicon system was VTubers,” says Workman. “The biggest VTubers now are mostly 2D. They’re live, they’re producing content 24/7, they’re getting the highest engagement on YouTube and some of the highest on TikTok. They’re massive. Some of them want to move into 3D, but it has to all be live. But there’s this whole learning process and naturally they start with what is easily accessible.”

Often that means using inertial capture or the tracking hardware that comes with commercial virtual reality systems. “I was talking to a bunch of VTubers, and some of the higher-end ones were using inertial suits and whatnot, but the feeling was that for live streaming, inertial is a little unreliable.” Between complications with battery life, connectivity problems and accuracy issues, many creators are in need of a more robust and reliable solution.

“So my thought was, what if you gave a high quality optical system to an indie VTuber?” says Workman. “How about matching high fidelity motion with one of Epic Games’s high fidelity metahumans?”


Workman acquired 10 Vero cameras and set them up at home. “We weren’t even sure if it was going to work,” he says. “I have a drop ceiling and weird columns all over the place. We thought, ‘we’ll just see’.

“But on the first day I got results that looked good. It pretty much worked out of the gate. It’s very robust. I set it up once and haven’t adjusted it since, and it still gets the results I want, so I’m pretty happy with it.”

Workman began publishing videos documenting his setup and learning process, using Shōgun and Unreal Engine to link his performances with Epic’s metahumans in real time. The results have attracted hundreds of thousands of views and sparked significant interest among the VTubing community.

“I tell a lot of these people who are getting into high-end live VTubing, if you’re using inertial suits, they have batteries in them that die,” Workman says. “And you’re relying on Wi-Fi, which is horrible. It’s so unreliable in a live context. Whereas, it’s so lo-fi to just put optical markers on your suit and start shooting with an optical system. You can jump in and it just works and you could run it for 12 hours. It’s built for Nike and NASA to record crazy things, so by the time we get to entertainment it’s just so robust.”


“Since we’ve proved that you could set this up at your house, more and more people have been looking at optical for high-end live, and for solo or small operations,” says Workman. “It’s cool to see mocap virtual avatars really starting to happen, because I think a lot of people tried and failed in the past.

“They think, ‘Let’s buy four [inertial] suits and have someone make us an avatar and we’ll just make a music video.’ And it turns out, if you’ve never done it and you don’t have the right equipment, it’s extremely difficult!

“With the Vicon systems though, you’re not even necessarily going to need to clean up the mocap. If you’re carefully filming the angles, you don’t have to do anything.

“I think that since I put up the demos, people have realized that optical’s the best thing for live.”

That realization is starting to turn into action. “I’ve been talking to a couple of VTubers, and they have the budgets. It’s just the learning curve that’s holding them back, because if they’re going to be 3D, they’re going to use optical.
I think that eventually, if not very soon, all the major 3D VTubers are going to be running a Vicon system or some sort of stable optical, because it’s live and they’re just producing live 24/7. And I think that’s going to be a pretty big category of views. The number of creators that are doing it is still small, even the 2D ones, but I do see that evolution.”


Workman envisions a bigger role for motion capture in social content than just avatars talking to camera. One possibility is for the creation of licensed performances from musicians.

“Let’s get these really nice 3D performances that everyone can use. We’re starting to see that with a character that’s dancing to pre-licensed music in an animation, you can put that into your TikTok video, those are massive. Those make a lot of money for everyone involved,” he says.

“But how about getting the 3D avatar of Snoop Dogg performing his latest song, and you make your own music video using it? The best one goes the most viral, and it’s all licensed and monetized so that there’s no piracy. Someone can build that ecosystem, because those videos are doing massive numbers.

“Fortnite and Roblox and other platforms like that, they’re already kind of doing it. The next step is trying to make it user-generated content, not material produced exclusively by Epic Games or Sony.

“There are a couple of people kind of building that space, and that’s an ecosystem I’d like to be part of.”