AI brings safety to public transportation

16 min readJul 24, 2022

Introduction

This document describes the reasons behind the 5G Ride edge platform and how we designed the AI software to deliver safety to public transportation as a service platform. 5G Ride is a concept for self-driving, 5G-connected electric vehicles used in public transport. In the autumn of 2020, the vehicles were tested in traffic in Sweden at Royal Djurgården. In 2021 the project continues, focusing on the development of the Connected Control Tower. This document focus on the implementation that fully utilizes several unique Intel architectural components. It would not be possible without the DL Boost instruction set (VNNI), secure enclave (SGX), and large memory and low-latency storage (PMEM).

Overview

As autonomous buses become commonplace, the lack of a human driver will lessen passengers’ sense of security. The driver’s role is to navigate to the next bus stop according to a schedule, answer questions from passengers, and handle emergencies on and around the bus. Who will the passenger talk to and manage emergencies without a human driver? Being alone without a visible driver is a new situation for passengers and will intimidate them and give them a sense of being exposed to the unknown.

This POC (proof-of-concept) addresses this issue by providing solutions to improve passenger safety. An example of an implemented use case is reminding passengers that they forgot their luggage when leaving the bus. The AI alerts the control tower if the passenger ignores reminders and lets the luggage remain on the bus.

In addition, we implemented detecting if a passenger is sick. If a passenger falls over, the AI speaks to the passenger to ask for medical assistance. The control tower is alerted if the passenger does not respond or request help.

The use case utilizes video analytics and speech recognition. There is no need for a continuous video stream to a remotely located human. The video and audio feed is only started when a situation requires an alert to the operator.

Video analytics is also capable of detecting violence in real-time and responding instantaneously. The outside camera detects hazards around the bus and reports to the control tower in real-time if there are accidents or roadblocks. It also provides updated status for planning maintenance work, such as potholes, cleaning up litter and illegally parked cars, and updating bus schedules by detecting the number of people on the sidewalk waiting for the bus. The information is routed to the right system for each type of information collected from the buses.

Implementation challenges compared to the first 5G Ride

The first 5G Ride was known to the public from the participation of the Swedish royal family Prince Daniel acted as a passenger on the autonomous bus. However, our focus for this involvement was on detecting objects in front of the bus. It noticed humans, cars, bicycles, and potholes. The solution was implemented as a pure video analytics solution. The main software component is OpenVINO connected in a GStreamer pipeline utilizing code from the DL Streamer repository. By using several pre-trained models of the type of detection and recognition. These models made inferences in turn on each frame and built a list of regions of interest. When all inferences were done for one image frame, the region of interest was collected and watermarked by rendering according to the classification of the object identified. Real-time anonymization was done by custom rendering over faces and license plates. The reason for this anonymization was the GDPR rules regarding the live streaming of video from a vehicle.

Several peers in different companies could not understand how we succeeded in making all these inferences in the CPU. Especially one thing got me to realize what an impact VNNI had on inference performance in our solution. And it was when the lead AI architect from one of our AI partner companies said this was the first time she saw this level of inference done without GPU and was impressed by the low latency (and non-existent time jitter), which would be significant when using GPU. We influenced a range of SW companies in the AI domain to consider Intel CPU with VNNI when they need low-latency and low time-jitter inference.

5G Ride part II

The Control Tower — The portal screen of the Control Tower

However, 5G Ride part II is a much more complex implementation where the most critical inference part utilizes a custom model. This meant we could not use a pre-trained model from Open Model Zoo, not even a combination of different models as the previous 5G Ride project did. We built the model from scratch, which meant collecting and annotating the images. Much more time-consuming than anticipated. A detailed explanation of setting up a model training process (MLops) to constantly improve inference done by the already deployed solution will be described in a separate document.

Detecting passenger and a bag — Detecting a passenger and a bag

Still, the inference and the models used are just a tiny part of the software stack. The reason for this is the complex use case from an AI analytics point of view in this demo. There were several reasons for this; one is that inference from one image frame is not enough. Detecting lost luggage can not be done by looking at just one image frame. It needs to evaluate several image frames to detect a person with bags and then when this person starts to leave the baggage behind. The implementation needs to contain a notion of memory to set up rules that involve the previous state. The aim is to have a system that connects a piece of particular luggage with a specific person by using reidentification models. Assuming we can send push notifications to passengers (e.g., the passenger has a mobile app that contains the bus ticket and can give updated information about the schedule, and, in this case, push notification of lost personal items on the bus), we can send the picture of the right luggage to the person about to leave the bus. This last part is designed but not implemented during this specific POC.

Another part of the use case that significantly increases the software stack is all the interfaces that need to be connected. One is the external portals, where we use the MQTT protocol to send information about the number of passengers and different alerts. One portal is the Control Tower, where an operator can decide to communicate with the bus or even take control over it. There are also analyzed video streams that need to be secured to be at least GDPR compliant. The chosen security solution was too complex to implement in this POC, so we implemented our scalable video server in our data center to show how to fully utilize SGX end-to-end to get role-based access to the system. This meant the implementation of a generic portal with the secure WebRTC and every event with its associated images. WebRTC is by design very insecure so we needed to add deep integration into WireGuard VPN protocol to ensure the required security. However, our generic portal is designed to effortlessly integrate into customer portals.

When we understood that the demo of the Control Tower portal was in a different location far from the bus, which meant the passengers did not know what the AI did on the bus, we decided to give AI a voice. This also involved adding voice recognition. This also meant we could first alert the passenger when they forgot a piece of luggage and alert the Control Tower only when they did not pick up their bags. This saves bandwidth to the Control Tower and automatically handles several cases in the bus without overloading the operators at the Control Tower. A similar process was used when people fell over or lay on the seats. The system asks the passenger if anything is wrong to assess if it is a medical situation, and if the problem did not resolve, the Control Tower was alerted.

The OpenVINO toolchain showed its strength by making it possible to implement a full voice-controlled assistant within a minimal project timeframe. However, there were some challenges like finding and tuning models, correcting noise cancelation to improve recognition, and general design to know if a person talked to the bus. The last problem was solved by introducing the persona “Iris” and opportunistic speech inference to match sentences on every audio heard. One of the models could understand over 1000 words when all conditions were perfect. Side note, many voice assistants use a starting keyword when the user wants to address the assistant. This is because the actual Natural Language Processing is done in the cloud, and there is a need to limit this communication. Our solution does all the processing locally with total privacy without transmitting conversations outside the bus and without involving 3rd party listening in a potentially private exchange.

But, on the demo day with the journalists, we decided to deactivate this function after a soundcheck on the noise level and mitigate the risk of one of the journalists getting into an argumentation with Iris. (This was solved by using the generic portal override function). The voice function was still active, so the bus could tell the passenger directly when a bag was left behind. The text-to-speech generation time was less than 500ms for a complex sentence, and speech-to-text even quicker, but sometimes needed to reinterpret the words when the human completed the sentence, to correct some ambiguous words.

Implementation

The implementation consists of the Edge Communication server, Edge AI server, and DC Portal server.

Edge Communication Server

The communication server handles all the ways to communicate with the data center. Its responsibility is to use the most cost-efficient way securely. It utilizes Ethernet or WLAN when available, and in other cases, uses 5G SA, ordinary 5G, and 4G LTE using multiple modems. It is responsible for recording the baseband telemetry from each modem to improve the quality of service prediction by correlating with position and time. The communication server can optimize how other services communicate. We have prepared for a future POC, including a WLAN hotspot or mobile 5G base station. Figure 1 shows how the communication server optimizes both AI server connectivity and network experience for the passenger.

Show communication bandwidth at different locations. — Figure 1

How communication Server evolved in 5G Ride II

The secure communication has evolved from utilizing tunneling based on SSH in the first 5G Ride to a mesh network based on WireGuard in 5G Ride II. The problem with SSH is the performance, and WireGuard improves it a lot. The security is also better because each communication path has a different certificate that an SGX-protected agent can generate to ensure node identity. The mesh network over multiple modems improves network performance and resilience.

The communication server also handles what kind of data can be transferred depending on connectivity. Bulk transfers will only be done when high-speed and a low-cost connection is available. When communication is wireless on a low-bandwidth channel, only telemetry events with a small payload are transmitted. The encoding using image and video feed is controlled by how much bandwidth is expected to be available.

Note one that the WebRTC server is implemented so only clients connected in the secure mesh could get a video feed (explicitly prohibiting STUN & TURN servers so clients must be in the same logical network segment as the WebRTC server, which they are when in the WireGuard mesh network). This is the current state of the art in end-to-end network security and also made it possible to differentiate the video feeds from the data center based on the user’s access role (down to different levels of anonymization). And because it is a point-to-point mesh network integrated into each server node in DC, we can horizontally scale the servers without external VPN servers, which would be a potential single-point bottleneck. (See Figure 2 — where each Portal Server is directly connected to clients because they must be in a secure domain to be able to communicate)

Note two that DC has no open ports except for WireGuard, which reduces the attack surface to a minimum.

Edge AI Server

The current implementation of the AI server platform has three main streams to execute inference. 2 video feeds from the dashcam and ceiling cam and an audio stream for speech recognition. Internally and externally, there are several communication paths for presenting the results from the inference. The challenge is to ensure low latency in all communication paths and the correct order of events because the response depends on the order of the events. The implementation is container-based and utilizes different implementation languages depending on storage, CPU, or communication bound. When storage is the limiting factor, we use PMEM to store and retrieve a large amount of data with low latency. The video analytics inference is implemented in C++ because a large part of the logic is custom code that needs to be compiled to optimize performance based on the amount of data to process. Audio analytics has less data bandwidth and can handle all processing well before the real-time deadline, even when implemented in Python. Figure 2 shows the edge node on the left and the data center on the top right. The bottom right is a test setup we used to test both current models for video analytics and the security of different WireGuard-based secure mesh networks. We validated the network security solution by setting up WireGuard and made sure getting a live video feed was effortless, and the process was easy to describe.

Event Scheduler

The heart of the AI server in this POC is the Event Scheduler. In figure 2, the content of the Event Scheduler is expanded in the picture bottom left. The backbone is based on Apache Kafka, which collects all events generated from different inferences and external commands, and then distributes the processing of this data. Apache Kafka is tuned for low latency in the edge node but it is adjusted for redundancy in the data center. The central processor of the data is the state machine which is a custom code written in Python. All use cases are coded as different states that cause the state machine to generate an appropriate response. Adding use cases is a matter of manually adding rules to the state machine. Side note: Currently investigating a transformer-based model that is usually used in NLP (Natural Language Processing) for complex state-based decision-making.

REST API

Because the AI server is designed to be a platform as a service, we also handle REST, MQTT, and file-based APIs, which also were used internally in this POC. The REST server is based on FastAPI, the fastest API server at the moment, and it is written in Python. Many SW developers prefer Python because it shortens the development time of new features. The REST server is backed by celery, a distributed asynchronous task queue that handles slow tasks in the background. For example, functions that involve communicating to the external MQTT server used by Control Tower are prolonged and would otherwise impact the performance of other tasks in the edge node.

MQTT API

The Edge-Node and the DC implement a much more performant MQTT server based on WebSocket that leverages the speed and security of WireGuard, which ensures no known security exploits.

Here we implemented messages that the partners did not manage to implement during the project. We have access to all our MQTT messages using our generic portal, both those partners implemented and the ones they did not. Side note: Partners did not focus on security in this POC, but they are interested in utilizing SGX to improve security. This platform aims to make it easy for them and other customers to use SGX in combination with state-of-the-art network security based on WireGuard.

FILE API

The text is generated by the speech recognizer and sent to the speech generator used files. The reason was to test the file-based API and cut down development time by not including REST, Kafka, or MQTT clients. Using Linux’s low-level inotify mechanism to monitor the file events resulted in lower latency to transfer the data from one container to the next than going via a network-based API! By creating new files with increasing index, we had a solution comparable to Kafka’s history log of events. It made debugging much more straightforward. Side note: I did not test gRPC in this POC, but have seen in later POCs that it would be a good candidate for low latency communication.

Kafka API

Because we include a Kafka server with support for Kafka Producer, Consumer, Connector, Streams, and Admin API, which has SDK for all major languages, it makes it easy to develop custom containers with access to all events and data streams.

In summary, the Event Scheduler contains a platform to quickly add new services because it supports all the standard API types. The REST and MQTT server are easy to extend with new topics to handle. The REST server also supports rendering web pages, so a testing framework is available during development. Also, a hot reload of changes in all APIs is supported, saving time by not needing to rebuild everything while creating new use cases.

Block diagram over the complete 5G Ride II — Figure 2

Video streaming

All internal video streaming is based on RTSP. It is a low-latency tried-and-true video technology and added custom code to reencode the video for different bitrates on the fly. It is easy to fan out to multiple endpoints for processing or scale to other RTSP servers for routing. We added a WebRTC server to allow humans to see the video stream in a normal browser without extra plug-ins.

Secure video transmission

A feature of WebRTC is it is excellent to get around firewalls by using STUN and TURN servers. However, in our case, we want to contain the WebRTC stream within the secure domain assigned to the client. The WebRTC server is custom configured for this, and only allows access between the video server and client via the WireGuard interface and prohibits all the protocols that could cause data leaks at the client-side. We aim to have security by default. In the current implementation, the client must install WireGuard manually. Even if it is simple enough for a non-technical person, we aim to automate this and make it even more secure. We will implement an agent that runs in a secure enclave on the client’s computer. It constantly ensures correct setup, verifies the integrity of the client machine, and updates WireGuard configuration as needed. SGX’s remote attestation will be used to ensure integrity on the edge node and the client machine (see figure 3).

Security

SGX will secure models, stored data, and communication on the edge node (see figure 4.)

On the client-side, SGX protects the VPN endpoint (WireGuard).

Using SGX to ensure each node is identified on the HW level. Running privacy-sensitive code in encrypted memory and securely storing encryption keys. Because we use WireGuard with proven security, the attempt to compromise the system must be the endpoints, and most probably the edge node or client machines. This attack will be much more difficult because SGX is combined with Secure Boot, so the boot process is also protected. We are constantly evaluating any weak links in the chain where someone could easily compromise security, and it is an ongoing effort. Let me know what can be improved!

AI in far-edge-to-cloud

As seen in this project and earlier projects, we need to train the models after deployment constantly. It uses actual data that is collected continuously. Without a post-deploy collection of training data, the model will soon be “stale.”

Model staleness over time vs refreshing models over time

Each update of the models with new data from sensors used in production will improve accuracy and adjust for changes in the data to classify. We are following this way of working means adding and updating the training material used by the deep learning process in the data center. And update the models used in the edge node as soon as there is an improved model to roll out. In the next project, we will also have several far-edge devices, increasing the complexity of a roll-out. All this must be fully automatic because there are too many steps where human error can cause a system to fail. We use the following process in this POC and even more in the next (see figure 5). The training will be an ongoing process, probably active 24/7 in the data center. It will process new data as it becomes available. The sources will be humans that correct misidentification from the models. Examples of these images will be annotated correctly and sent to the data center as inference feedback. Then the training will be done with data sets that include these new images and tested, so the failed classifications are corrected. Then the updated model is released and ready for roll-out. Side note: In this POC we used Yolov4 but in an upcoming POC we switched to ATSS which seems much better when retraining with updated images.

We will also investigate different ways to utilize unsupervised training because manual annotation is very time-consuming. Side note: In the upcoming POC we use a system that with inference chooses images that training would benefit the most if it got help with classification from a human. The so-called “Human-in-the-loop”. This saves annotation time by only picking a sub-set of all images for the human to work on.

This process is part of the MLops process utilized in this POC and as necessary as the DevOps process that is common today.

Data Center to Edge to Far-Edge — Figure 5

In summary, the constant connection between the cloud and edge (and far edge) is essential to keep and improve edge-based inference quality and optimized for the specific AI acceleration used on the device.