Failed project: An ESP32-S3 based KVM solution

link

The setup
The setup

As I’m planning to build a proper NAS server for myself I’ve begun looking into a Lights-out management solution for it (or KVM, if you prefer). The gist is that I want to be able to access the server’s display and keyboard/mouse without having to physically be in front of it.

Currently there are a few solutions for that:

I really liked the idea of the PiKVM solution, but as Raspberry Pi’s are still not cheap or readily available, I’ve had an idea to use an ESP32-S3 microcontroller to replace the Raspberry Pi. The ESP32-S3 seemed like a good fit for the task, because it has USB host support - a HDMI capture dongle could be connected to it.

TL;DR: I’ve failed to get the MS2109 based HDMI capture dongle to work with the ESP32-S3, because it’s USB descriptors are broken with USB 1.1.

Initial experiments with video capture

link

Before I even started thinking about connecting the HDMI capture dongle to the S3 I took a look at theese two projects:

Both of these examples take the MJPEG stream from a USB webcam and stream it via HTTP to a web browser. The difference between them is that the first one creates a hotspot and the second one connects to an existing WiFi network. In addition to that, they both appear to share code.

The interface of the usb_camera_mic_spk example, feat. my face
The interface of the usb_camera_mic_spk example, feat. my face

I’ve tried both of them with a Microsoft Life-Cam 3000 webcam, and I was able to get about 10fps at 640x480 resolution. I figured this should be enough to see a terminal while using the KVM. So I proceeded to the next step.

Dongle trouble

link

After the success with streaming the video it was time to connect the HDMI Capture dongle. Despite blue color of the USB connector on the dongle, it is in fact a USB 2.0 device, so I hoped it would be compatible with the USB 1.1 interface of the S3.

To my unfortunate surprise, after connecting it to the ESP32-S3 the microcontroller just hanged, causing the watchdog to trigger.

This is the log output I got from idf.py monitor:

0x4201f156: task_wdt_timeout_handling at /opt/esp-idf/components/esp_system/task_wdt/task_wdt.c:461 (discriminator 3)

0x4201f302: task_wdt_isr at /opt/esp-idf/components/esp_system/task_wdt/task_wdt.c:585

0x403777dd: _xt_lowint1 at /opt/esp-idf/components/freertos/FreeRTOS-Kernel/portable/xtensa/xtensa_vectors.S:1118

0x4201710f: usb_parse_next_descriptor at /opt/esp-idf/components/usb/usb_helpers.c:22 (discriminator 1)

0x4201717a: usb_parse_next_descriptor_of_type at /opt/esp-idf/components/usb/usb_helpers.c:45

0x4201727b: usb_parse_interface_number_of_alternate at /opt/esp-idf/components/usb/usb_helpers.c:65

0x42010719: parse_configuration at /home/alufers/Installs/esp-idf-video-streaming/components/usb_host_uvc/src/descriptor.c:177 (discriminator 2)

0x42010852: raw_desc_to_libusb_config at /home/alufers/Installs/esp-idf-video-streaming/components/usb_host_uvc/src/descriptor.c:225

0x4201042c: libusb_get_config_descriptor at /home/alufers/Installs/esp-idf-video-streaming/components/usb_host_uvc/src/libusb_adapter.c:699 (discriminator 2)

0x4200d7cc: uvc_get_device_list at /home/alufers/Installs/esp-idf-video-streaming/components/usb_host_uvc/libuvc/src/device.c:721

0x4200d926: uvc_find_device at /home/alufers/Installs/esp-idf-video-streaming/components/usb_host_uvc/libuvc/src/device.c:140

0x4200bd2b: app_main at /home/alufers/Installs/esp-idf-video-streaming/main/main.c:384 (discriminator 2)

So it seems like the esp-idf was hanging on reading the USB descriptors of the dongle. I’ve quickly opened the location of the crashing code, and noticed that there are logging statements in the usb_parse_next_descriptor function. So I’ve quickly added esp_log_level_set("*", ESP_LOG_VERBOSE); to the app_main function, and recompiled the project.

This quickly lead me to this line of the usb_stream.c file. The parsing of the configuration descriptor was getting stuck at the offset of 153355, because seemingly the length of the descriptor was 0.

I’ve decided to plug in the dongle into my Linux PC with Wireshark running and monitoring usbmon0. To my surprise the configuration descriptor reported to the PC was 1211 bytes long, instead of 355 which the ESP32-S3 got.

I’ve also checked this with the Microsoft Life-Cam 3000 webcam, and the configuration descriptor there also differed between the PC and the ESP32-S3.

The configuration descriptor length of the Life-Cam 3000 as reported by Wireshark
The configuration descriptor length of the Life-Cam 3000 as reported by Wireshark

The configuration descriptor length of the Life-Cam 3000 as reported by the ESP32-S3
The configuration descriptor length of the Life-Cam 3000 as reported by the ESP32-S3

So it appears that the descriptors can differ depending on whether USB 1.1 Full Speed or USB 2.0 High Speed is used.

I’ve confirmed that by running the bare_api example from the tinyusb library on a Raspberry Pi Pico, which also supports being a USB 1.1 Full Speed host. The configuration descriptor reported by the webcam was 355 bytes long, just like the ESP32-S3. Here are the contents of the configuration descriptor as reported by the dongle

Debugging the USB enumeration

link

As I couldn’t directly compare the descriptor parsing code between the ESP32-S3 and my PC I;ve chosen to sniff the USB communication between the ESP chip and the dongle. To do that I’ve flashed pico_usb_sniffer on my Raspberry Pi Pico, and jerry-rigged this setup:

The setup for sniffing the USB communication
The setup for sniffing the USB communication

I’ve hooked the D+ and D- pins of the ESP32-S3 and connected them to the Pico runningg the sniffer. By using the Python script included in the repository I’ve saved a capture of the enumeration process. After that I’ve opened it in WIreshark to inspect it. This was the result:

The capture open in Wireshark
The capture open in Wireshark

It turns out it’s not the Espressif’s code at fault here. Instead it seems that the dongle has just broken USB 1.1 support, even though it in theory could transmit a low-resolution, low-framerate video stream over USB 1.1. This prohibits it from being used with the ESP32-S3.

What can be done?

link

After that I opened the dongle and took a look inside:

The PCB of the dongle. It has the marking "SFX_HDMI_VC_1.6" and a large heat-sinked chip.
The PCB of the dongle. It has the marking "SFX_HDMI_VC_1.6" and a large heat-sinked chip.

As I didn’t want to remove the heat-sink from the chip, I’ve googled the marking on the board and discovered that it contains the MS2109 from MacroSilicon. In theory it should support USB 1.1, as even it’s website says it’s “Compatible with USB1.1 mode”.

What is promising is that, the brochure mentions the use of an EEPROM to store some configuration (and possibly code for the built-in MCU). As seen on the previous photo, by dongle appears to have a the AT24C08D - an 8Kbit (1 KiB) EEPROM from Microchip.

Functional block diagram of the MS2109
Functional block diagram of the MS2109

I would imagine that probably the USB descriptors will be one thing that can be configured in the EEPROM, since it would be the first thing OEMs would want to change. I haven’t gotten to de-soldering and reading the EEPROM yet, but I have found this repository: https://github.com/BertoldVdb/ms-tools

It contains a tool for talking to the MS2109 over a HID interface and even executing code on it to dump the ROM and EEPROM. Maybe it could be used to fix the USB descriptors on the dongle. I’ll update this post if I get to it.

Conclusion

link

Using the ESP32-S3 for video capture from HDMI is not possible (yet), so I can’t use it as a KVM solution :(