5.2.1 Voice Control

The voice control interface provides a complete set of voice interaction capabilities, including speech synthesis, speech recognition, audio noise reduction, audio playback, and volume control.

Key Features

Text-to-Speech (TTS)

  • Text-to-speech: Convert text into natural-sounding speech.

  • Multi-language support: Supports Chinese, English, and other languages.

  • Emotional speech: Supports different emotional styles for synthesis.

  • Priority management: Supports multi-level priority control.

Automatic Speech Recognition (ASR) (coming soon)

  • Real-time recognition: Supports real-time speech recognition.

  • Multi-language recognition: Supports Chinese, English, and other languages.

  • Audio stream processing: Supports real-time processing of audio streams.

Audio Processing

  • Real-time noise reduction: Supports real-time audio denoising.

  • Voice activity detection: Supports VAD (Voice Activity Detection).

  • Streaming: Supports streaming of denoised audio.

Audio Playback

  • Audio stream playback: Supports playback of audio data streams.

  • Priority control: Supports playback priority management.

  • Format support: Supports multiple audio formats.

Volume Control

  • Volume adjustment: Supports system volume adjustment.

  • Mute control: Supports mute / unmute.

  • Volume query: Supports querying the current volume.

Volume Control Services

Service Name

Data Type

Description

/aimdk_5Fmsgs/srv/GetVolume

GetVolume

Query volume

/aimdk_5Fmsgs/srv/SetVolume

SetVolume

Set volume

/aimdk_5Fmsgs/srv/GetMute

GetMute

Query mute status

/aimdk_5Fmsgs/srv/SetMute

SetMute

Set mute

  • GetVolume ros2-srv @ /hal/audio/srv/GetVolume.srv

    # Get Volume
    # Service: /aimdk_5Fmsgs/srv/GetVolume
    
    # Request
    CommonRequest request            # Request header
    
    ---
    
    # Response
    CommonResponse reponse           # Response header
    uint32 audio_volume              # Current volume (0–100)
    
  • SetVolume ros2-srv @ /hal/audio/srv/SetVolume.srv

    # Set Volume
    # Service: /aimdk_5Fmsgs/srv/SetVolume
    
    # Request
    CommonRequest request            # Request header
    uint32 audio_volume              # Target volume (0–100)
    
    ---
    
    # Response
    CommonResponse reponse           # Response header
    uint32 audio_volume              # Current volume (0–100)
    
  • GetMute ros2-srv @ /hal/audio/srv/GetMute.srv

    # Get Mute Status
    # Service: /aimdk_5Fmsgs/srv/GetMute
    
    # Request
    CommonRequest request            # Request header
    
    ---
    
    # Response
    CommonResponse reponse           # Response header
    bool is_mute                     # Current mute state
    
  • SetMute ros2-srv @ /hal/audio/srv/SetMute.srv

    # Set Mute
    # Service: /aimdk_5Fmsgs/srv/SetMute
    
    # Request
    CommonRequest request            # Request header
    bool is_mute                     # Target mute state
    
    ---
    
    # Response
    CommonResponse reponse           # Response header
    bool is_mute                     # Current mute state
    

Speech Synthesis Services

Service Name

Data Type

Description

/aimdk_5Fmsgs/srv/PlayTts

PlayTts

Text-to-speech playback

  • PlayTts ros2-srv @ interaction/srv/PlayTts.srv

    # TTS Playback
    # Service: /aimdk_5Fmsgs/srv/PlayTts
    
    # Request
    CommonRequest header
    PlayTtsRequest tts_req  # Embedded request msg
    
    ---
    
    # Response
    CommonResponse header
    PlayTtsResponse tts_resp  # Embedded response msg
    

    Where

    • PlayTtsRequest ros2-msg @ interaction/msg/PlayTtsRequest.msg

      # Embedded request msg
      
      string text                      # Text content
      TtsPriorityLevel priority_level  # Priority level (see TtsPriorityLevel below)
      uint32 priority_weight           # Priority weight (0–99)
      string domain                    # Caller domain
      string trace_id                  # Request trace ID
      bool is_interrupted              # Whether to interrupt broadcasts of the same priority (otherwise queued)
      
      • TtsPriorityLevel ros2-msg @ interaction/msg/TtsPriorityLevel.msg

        # TTS priority level
        uint8 value                      # Priority value
        

        Available TtsPriorityLevel values:

        Level

        Value

        Description

        Usage scenarios

        Emergency safety layer (SAFETY_L10)

        10

        Highest priority

        Safety alerts, emergency notifications

        Warning layer (WARNING_L8)

        8

        High priority

        Hazard alerts and warning messages

        System notice layer (SYSTEM_L7)

        7

        Medium-high priority

        System-level Notice

        Interaction response layer (INTERACTION_L6)

        6

        Medium priority

        User interaction and conversational responses

        Mission execution layer (MISSION_L4)

        4

        Medium-low priority

        Task execution and status broadcasts

        Service layer (SERVICE_L2)

        2

        Low priority

        Proactive services and reminders

        Background service layer (BACKGROUND_L1)

        1

        Lowest priority

        Background services and logging

        Audio playback priority mechanism:

        • This priority system applies to both TTS playback (PlayTts) and audio file playback (PlayMediaFile).

        • Higher priority playback interrupts lower priority playback.

        • For the same priority level, behavior is determined by priority_weight and is_interrupted.

        • The playback queue would be reset when interrupted

        • The emergency safety level has the highest priority and cannot be interrupted by any other level.

    • PlayTtsResponse ros2-msg @ interaction/msg/PlayTtsResponse.msg

      # Embedded response msg
      string text                      # Response text
      TtsPriorityLevel priority_level  # Priority level
      uint32 priority_weight           # Priority weight
      string domain                    # Caller domain
      string trace_id                  # Request trace ID
      bool is_success                  # Whether the request succeeded
      string error_message             # Error message
      uint32 estimated_duration        # Estimated duration (ms)
      

Audio File Playback Service

Service Name

Data Type

Description

/aimdk_5Fmsgs/srv/PlayMediaFile

PlayMediaFile

Play audio file

  • PlayMediaFile ros2-srv @ interaction/srv/PlayMediaFile.srv

    # Play audio file
    # Service: /aimdk_5Fmsgs/srv/PlayMediaFile
    
    # Request
    CommonRequest header
    PlayMediaFileRequest media_file_req
    
    ---
    
    # Response
    CommonResponse header
    PlayTtsResponse tts_resp  # Reuses PlayTtsResponse
    
    • PlayMediaFileRequest ros2-msg @ interaction/msg/PlayMediaFileRequest.msg

      # Embedded request msg
      
      string file_name  # Absolute path to the audio file (must be on the interaction compute unit and readable by all)
      uint32 sample_rate  # Currently unused, default 16k1ch
      TtsPriorityLevel priority_level  # Recommended default: INTERACTION_L6
      uint32 priority_weight  # Weight (0–99)
      string domain  # Caller domain
      string trace_id  # Request trace ID
      bool is_interrupted # Whether to interrupt broadcasts of the same priority (otherwise queued)
      

      For priority_level values, see the audio priority table.

    • PlayTtsResponse as described above.

    Notes:

    • Audio files must be PCM-encoded raw files (.pcm) or WAV files wrapping this PCM data (.wav).

    • Audio must be 16 kHz sample rate, 16-bit, mono.

    • Audio and video files must use absolute paths.

    • Audio and video files must be stored on the interaction compute unit (PC3, 10.0.1.42), not the development compute unit (PC2).

    • Audio and video files (and all parent directories up to root) must be readable by all users(new subdirectory under /var/tmp/ is recommended)

MIC Audio Stream Capture Topic

Supports receiving VAD (Voice Activity Detection) events on denoised audio and the corresponding audio stream.

Topic Name

Data Type

Description

QoS

Frequency

/agent/process_audio_output

ProcessedAudioOutput

VAD audio capture

-

Event-triggered, cached data for voice recognition would be sent in a burst at start of VAD event, then would update at ~25Hz

  • ProcessedAudioOutput ros2-msg @ interaction/msg/ProcessedAudioOutput.msg

    MessageHeader header  # Message header
    
    uint32 stream_id  # Audio stream ID (1: onboard mic, 2: external mic)
    AudioVadStateType audio_vad_state  # VAD state (0: no speech, 1: speech start, 2: in speech, 3: speech end)
    uint8[] audio_data  # Audio data (PCM, 16 kHz / 16 bit / 1 ch)
    

Audio stream format:

  • Sample rate: 16 kHz

  • Bit depth: 16 bit

  • Channels: mono

  • Encoding: PCM

Attention

The wake word required to activate VAD (since v0.9):

  • In default mode (built-in interaction ON), always say the wake word before target voice, as VAD only keep activated for a short while.

  • In only_voice mode (build-in interaction disabled), VAD keep activated for long once waked by the wake word. No more wake words needed later, all voice detected later on would be captured as VAD streams

Programming Examples

For detailed programming examples and code descriptions, see:

Safety Notes

Warning

Voice playback limitations

  • The TTS service uses a priority system; avoid starting multiple speech playbacks at the same time.

  • Higher-priority speech will interrupt lower-priority speech; configure priorities carefully.

  • Check the current playback state before starting new speech.

Caution

As standard ROS DO NOT handle cross-host service (request-response) well, please refer to SDK examples to use open interfaces in a robust way (with protection mechanisms e.g. exception safety and retransmission)

Note

Best Practices

  • Choose appropriate priority levels to avoid interfering with important announcements.

  • Implement monitoring and exception handling for speech playback.

  • Implement a playback queue for speech management.

  • Pay attention to the required audio format and sample rate.

  • The receive queue (QoS depth) of VAD should be large enough

  • Never forget wake words when using VAD