5.2.1 Voice Control

The voice control interface provides a complete set of voice interaction capabilities, including speech synthesis, speech recognition, audio noise reduction, audio playback, and volume control.

Key Features

Text-to-Speech (TTS)

  • Text-to-speech: Convert text into natural-sounding speech.

  • Multi-language support: Supports Chinese, English, and other languages.

  • Emotional speech: Supports different emotional styles for synthesis.

  • Priority management: Supports multi-level priority control.

Automatic Speech Recognition (ASR) (coming soon)

  • Real-time recognition: Supports real-time speech recognition.

  • Multi-language recognition: Supports Chinese, English, and other languages.

  • Audio stream processing: Supports real-time processing of audio streams.

Audio Processing

  • Real-time noise reduction: Supports real-time audio denoising.

  • Voice activity detection: Supports VAD (Voice Activity Detection).

  • Streaming: Supports streaming of denoised audio.

Audio Playback

  • Audio stream playback: Supports playback of audio data streams.

  • Priority control: Supports playback priority management.

  • Format support: Supports multiple audio formats.

Volume Control

  • Volume adjustment: Supports system volume adjustment.

  • Mute control: Supports mute / unmute.

  • Volume query: Supports querying the current volume.

Volume Control Services

Service Name

Data Type

Description

/aimdk_5Fmsgs/srv/GetVolume

GetVolume

Query volume

/aimdk_5Fmsgs/srv/SetVolume

SetVolume

Set volume

/aimdk_5Fmsgs/srv/GetMute

GetMute

Query mute status

/aimdk_5Fmsgs/srv/SetMute

SetMute

Set mute

  • GetVolume ros2-srv @ hal/audio/srv/GetVolume.srv

    # Get Volume
    # Service: /aimdk_5Fmsgs/srv/GetVolume
    
    # Request
    CommonRequest request            # Request header
    
    ---
    
    # Response
    CommonResponse reponse           # Response header
    uint32 audio_volume              # Current volume (0–100)
    
  • SetVolume ros2-srv @ hal/audio/srv/SetVolume.srv

    # Set Volume
    # Service: /aimdk_5Fmsgs/srv/SetVolume
    
    # Request
    CommonRequest request            # Request header
    uint32 audio_volume              # Target volume (0–100)
    
    ---
    
    # Response
    CommonResponse reponse           # Response header
    uint32 audio_volume              # Current volume (0–100)
    
  • GetMute ros2-srv @ hal/audio/srv/GetMute.srv

    # Get Mute Status
    # Service: /aimdk_5Fmsgs/srv/GetMute
    
    # Request
    CommonRequest request            # Request header
    
    ---
    
    # Response
    CommonResponse reponse           # Response header
    bool is_mute                     # Current mute state
    
  • SetMute ros2-srv @ hal/audio/srv/SetMute.srv

    # Set Mute
    # Service: /aimdk_5Fmsgs/srv/SetMute
    
    # Request
    CommonRequest request            # Request header
    bool is_mute                     # Target mute state
    
    ---
    
    # Response
    CommonResponse reponse           # Response header
    bool is_mute                     # Current mute state
    

Speech Synthesis Services

Service Name

Data Type

Description

/aimdk_5Fmsgs/srv/PlayTts

PlayTts

Text-to-speech playback

  • PlayTts ros2-srv @ interaction/srv/PlayTts.srv

    # TTS Playback
    # Service: /aimdk_5Fmsgs/srv/PlayTts
    
    # Request
    CommonRequest header
    PlayTtsRequest tts_req  # Embedded request msg
    
    ---
    
    # Response
    CommonResponse header
    PlayTtsResponse tts_resp  # Embedded response msg
    

    Where

    • PlayTtsRequest ros2-msg @ interaction/msg/PlayTtsRequest.msg

      # Embedded request msg
      
      string text                      # Text content
      TtsPriorityLevel priority_level  # Priority level (see TtsPriorityLevel below)
      uint32 priority_weight           # Priority weight (0–99)
      string domain                    # Caller domain
      string trace_id                  # Request trace ID
      bool is_interrupted              # Whether to interrupt broadcasts of the same priority (otherwise queued)
      
      • TtsPriorityLevel ros2-msg @ interaction/msg/TtsPriorityLevel.msg

        # TTS priority level
        uint8 value                      # Priority value
        

        Available TtsPriorityLevel values:

        Level

        Value

        Description

        Usage scenarios

        Emergency safety layer (SAFETY_L10)

        10

        Highest priority

        Safety alerts, emergency notifications

        Warning layer (WARNING_L8)

        8

        High priority

        Hazard alerts and warning messages

        System notice layer (SYSTEM_L7)

        7

        Medium-high priority

        System-level Notice

        Interaction response layer (INTERACTION_L6)

        6

        Medium priority

        User interaction and conversational responses

        Mission execution layer (MISSION_L4)

        4

        Medium-low priority

        Task execution and status broadcasts

        Service layer (SERVICE_L2)

        2

        Low priority

        Proactive services and reminders

        Background service layer (BACKGROUND_L1)

        1

        Lowest priority

        Background services and logging

        Audio playback priority mechanism:

        • This priority system applies to both TTS playback (PlayTts) and audio file playback (PlayAudioFile).

        • Higher priority playback interrupts lower priority playback.

        • For the same priority level, behavior is determined by priority_weight and is_interrupted.

        • The playback queue would be reset when interrupted

        • The emergency safety level has the highest priority and cannot be interrupted by any other level.

    • PlayTtsResponse ros2-msg @ interaction/msg/PlayTtsResponse.msg

      # Embedded response msg
      string text                      # Response text
      TtsPriorityLevel priority_level  # Priority level
      uint32 priority_weight           # Priority weight
      string domain                    # Caller domain
      string trace_id                  # Request trace ID
      bool is_success                  # Whether the request succeeded
      string error_message             # Error message
      uint32 estimated_duration        # Estimated duration (ms)
      

Audio File Playback Service

Call the PlayAudioFile service with the audio file path (file_path = parent directory, file_name = filename) and priority to trigger playback. A response where reponse.status.value == 1 indicates success. See examples: C++ / Python.

Service Name

Data Type

Description

/aimdk_5Fmsgs/srv/PlayAudioFile

PlayAudioFile

Play audio file

  • PlayAudioFile ros2-srv @ hal/audio/srv/PlayAudioFile.srv

    # Play audio file
    # Service: /aimdk_5Fmsgs/srv/PlayAudioFile
    
    # Request
    CommonRequest request            # Request header
    AudioFile file                   # Audio file info (required)
    builtin_interfaces/Time play_stamps  # Optional; scheduled play time, default: play immediately
    
    ---
    
    # Response
    CommonResponse reponse           # Response header
    
    • AudioFile ros2-msg @ hal/audio/msg/AudioFile.msg

      string pkg_name        # Required; identifies the caller
      string file_name       # Required; file name
      string file_path       # Required; parent directory path (uses system default if empty; must not end with the file name)
      AudioInfo info         # Required for PCM, optional for WAV; audio format
      uint32 priority        # Required; priority (1–10, default 6)
      uint32 priority_weight # Optional; (1–100) final priority = priority + priority_weight%
      

    Notes:

    • Audio files must be PCM-encoded raw files (.pcm) or WAV files wrapping this PCM data (.wav). Other formats such as MP3 are not supported.

    • Audio must be 16 kHz sample rate, 16-bit, mono.

    • When using an absolute path, set file_path to the parent directory and file_name to the file name.

    • Audio files must be stored on the interaction compute unit (PC3, 10.0.1.42), not the development compute unit (PC2).

    • The audio folder and all its parent directories must be readable by all users (a subdirectory under /var/tmp/ is recommended).

Audio Stream Playback

Provides raw audio stream playback support

Service Name

Data Type

Description

/aimdk_5Fmsgs/srv/RequestAudioFocus

RequestAudioFocus

Request audio playback focus

/aimdk_5Fmsgs/srv/AbandonAudioFocus

AbandonAudioFocus

Release audio playback focus

Topic Name

Data Type

Description

QoS

Frequency

/aima/hal/audio/playback

AudioPlayback

Audio stream playback

-

Published by the user application

/aima/hal/audio/focus_response

FocusResponse

Audio focus change events

-

Event-triggered; notifies when audio focus is preempted

/aima/hal/audio/play_state

PlayStateChange

Audio playback state events

-

Event-triggered; notifies on audio playback state change

  • RequestAudioFocus ros2-srv @ hal/audio/srv/RequestAudioFocus.srv

    # Request audio playback focus
    # Service: /aimdk_5Fmsgs/srv/RequestAudioFocus
    
    # Request
    CommonRequest request # Request header
    
    FocusRequester focus_requester # Focus request info
    
    ---
    
    # Response
    CommonResponse reponse # Response header
    
    FocusResponse focus_response # Request result
    
    • FocusRequester ros2-msg @ hal/audio/msg/FocusRequester.msg

      string pkg_name # Playback source identifier
      
      uint32 priority # Priority (1–10, default 6)
      
      uint32 priority_weight # Weight (optional); breaks ties within same priority level
      
    • FocusResponse ros2-msg @ hal/audio/msg/FocusResponse.msg

      string pkg_name # Playback source identifier
      
      bool focus_gain # Focus grant result
      
  • AbandonAudioFocus ros2-srv @ hal/audio/srv/AbandonAudioFocus.srv

    # Release audio playback focus
    # Service: /aimdk_5Fmsgs/srv/AbandonAudioFocus
    
    # Request
    CommonRequest request # Request header
    
    FocusRequester focus_requester # Focus request info
    
    ---
    
    # Response
    CommonResponse reponse # Response header
    
    FocusResponse focus_response # Request result
    
    • FocusRequester and FocusResponse are defined as above >>

  • AudioPlayback ros2-msg @ hal/audio/msg/AudioPlayback.msg

    # Audio stream playback
    # Topic: /aima/hal/audio/playback
    
    builtin_interfaces/Time stamps # Timestamp
    
    AudioInfo info # Audio format
    
    AudioData data # Audio data
    
    string pkg_name # Playback source identifier
    
    string token_id # (Optional) changing token_id clears the playback buffer (used to interrupt current playback)
    
    • AudioInfo ros2-msg @ hal/audio/msg/AudioInfo.msg

      uint8 channels # Number of channels
      
      uint32 sample_rate # Sample rate [Hz], currently only 16000
      
      uint32 size # (not used) write size [byte]
      
      string sample_format # Audio format, currently only S16LE
      
      string coding_format # Audio coding format, currently only pcm
      
    • AudioData ros2-msg @ hal/audio/msg/AudioData.msg

      uint8[] data
      
  • FocusResponse ros2-msg @ hal/audio/msg/FocusResponse.msg

    # Audio focus change events
    # Topic: /aima/hal/audio/focus_response
    
    string pkg_name # Playback source identifier
    
    bool focus_gain # Focus grant result
    
  • PlayStateChange ros2-msg @ hal/audio/msg/PlayStateChange.msg

    # Audio playback state events
    # Topic: /aima/hal/audio/play_state
    
    string pkg_name # Playback source identifier
    
    PlayStateType state # Playback state
    
    • PlayStateType ros2-msg @ hal/audio/msg/PlayStateType.msg

      uint8 value   # Playback state (0: off, 1: playing, 2: stopped)
      

MIC Audio Stream Capture Topic

Supports receiving real-time VAD (Voice Activity Detection) events on denoised audio and the corresponding audio stream, as well as raw audio stream capture.

Topic Name

Data Type

Description

QoS

Frequency

/agent/process_audio_output

ProcessedAudioOutput

VAD audio capture

-

Event-triggered, cached data for voice recognition would be sent in a burst at start of VAD event, then would update at ~25Hz

/aima/hal/audio/capture

AudioCapture

Raw audio capture

-

  • ProcessedAudioOutput ros2-msg @ interaction/msg/ProcessedAudioOutput.msg

    MessageHeader header  # Message header
    
    uint32 stream_id  # Audio stream ID (1: onboard mic, 2: external mic; regardless of which mic is active, audio is always published with stream_id=1 and saved under the fixed stream_1/ subdirectory)
    AudioVadStateType audio_vad_state  # VAD state (0: no speech, 1: speech start, 2: in speech, 3: speech end)
    uint8[] audio_data  # Audio data (PCM, 16 kHz / 16 bit / 1 ch)
    

Audio stream format:

  • Sample rate: 16 kHz

  • Bit depth: 16 bit

  • Channels: mono

  • Encoding: PCM

Attention

The wake word required to activate VAD (since v0.9):

  • In default mode (built-in interaction ON), always say the wake word before target voice, as VAD only keep activated for a short while.

  • In only_voice mode (build-in interaction disabled), VAD keep activated for long once waked by the wake word. No more wake words needed later, all voice detected later on would be captured as VAD streams

  • AudioCapture ros2-msg @ hal/audio/msg/AudioCapture.msg

    # Raw audio capture
    # Topic: /aima/hal/audio/capture
    
    builtin_interfaces/Time stamps
    
    uint8 mic_channels # Number of microphone channels
    
    uint8 ref_channels # Number of reference (echo-cancellation) channels
    
    AudioInfo info # Audio format
    
    AudioData data # Audio data
    
    string pkg_name # Audio source
    

    AudioInfo definition >> AudioData definition >>

Microphone Control Services

Service Name

Data Type

Description

/aimdk_5Fmsgs/srv/GetMicSourceRequest

GetMicSourceRequest

Query the current MIC device

/aimdk_5Fmsgs/srv/SetMicSourceRequest

SetMicSourceRequest

Switch the MIC device

  • GetMicSourceRequest ros2-srv @ interaction/srv/GetMicSourceRequest.srv

    # Query current MIC device
    # Service: /aimdk_5Fmsgs/srv/GetMicSourceRequest
    
    # Request
    CommonRequest header
    
    ---
    
    # Response
    CommonResponse header
    uint32 mic_source  # 0: built-in mic, 1: external mic
    
  • SetMicSourceRequest ros2-srv @ interaction/srv/SetMicSourceRequest.srv

    # Switch MIC device
    # Service: /aimdk_5Fmsgs/srv/SetMicSourceRequest
    
    # Request
    CommonRequest header
    uint32 mic_source  # 0: built-in mic, 1: external mic
    
    ---
    
    # Response
    CommonResponse header
    

Programming Examples

For detailed programming examples and code descriptions, see:

Safety Notes

Warning

Voice playback limitations

  • The TTS service uses a priority system; avoid starting multiple speech playbacks at the same time.

  • Higher-priority speech will interrupt lower-priority speech; configure priorities carefully.

  • Check the current playback state before starting new speech.

Caution

As standard ROS DO NOT handle cross-host service (request-response) well, please refer to SDK examples to use open interfaces in a robust way (with protection mechanisms e.g. exception safety and retransmission)

Note

Best Practices

  • Choose appropriate priority levels to avoid interfering with important announcements.

  • Implement monitoring and exception handling for speech playback.

  • Implement a playback queue for speech management.

  • Pay attention to the required audio format and sample rate.

  • The receive queue (QoS depth) of VAD should be large enough

  • Never forget wake words when using VAD