5.2.1 Voice Control

The voice control interface provides a complete set of voice interaction capabilities, including speech synthesis, speech recognition, audio noise reduction, audio playback, and volume control.

Key Features

Text-to-Speech (TTS)

  • Text-to-speech: Convert text into natural-sounding speech.

  • Multi-language support: Supports Chinese, English, and other languages.

  • Emotional speech: Supports different emotional styles for synthesis.

  • Priority management: Supports multi-level priority control.

Automatic Speech Recognition (ASR) (coming soon)

  • Real-time recognition: Supports real-time speech recognition.

  • Multi-language recognition: Supports Chinese, English, and other languages.

  • Audio stream processing: Supports real-time processing of audio streams.

Audio Processing

  • Real-time noise reduction: Supports real-time audio denoising.

  • Voice activity detection: Supports VAD (Voice Activity Detection).

  • Streaming: Supports streaming of denoised audio.

Audio Playback

  • Audio stream playback: Supports playback of audio data streams.

  • Priority control: Supports playback priority management.

  • Format support: Supports multiple audio formats.

Volume Control

  • Volume adjustment: Supports system volume adjustment.

  • Mute control: Supports mute / unmute.

  • Volume query: Supports querying the current volume.

Volume Control Services

Service Name

Data Type

Description

/aimdk_5Fmsgs/srv/GetVolume

GetVolume

Query volume

/aimdk_5Fmsgs/srv/SetVolume

SetVolume

Set volume

/aimdk_5Fmsgs/srv/GetMute

GetMute

Query mute status

/aimdk_5Fmsgs/srv/SetMute

SetMute

Set mute

  • GetVolume ros2-srv @ /hal/audio/srv/GetVolume.srv

    # Get Volume
    # Service: /aimdk_5Fmsgs/srv/GetVolume
    
    # Request
    CommonRequest request            # Request header
    
    ---
    
    # Response
    CommonResponse response          # Response header
    uint32 audio_volume              # Current volume (0–100)
    
  • SetVolume ros2-srv @ /hal/audio/srv/SetVolume.srv

    # Set Volume
    # Service: /aimdk_5Fmsgs/srv/SetVolume
    
    # Request
    CommonRequest request            # Request header
    uint32 audio_volume              # Target volume (0–100)
    
    ---
    
    # Response
    CommonResponse response          # Response header
    uint32 audio_volume              # Current volume (0–100)
    
  • GetMute ros2-srv @ /hal/audio/srv/GetMute.srv

    # Get Mute Status
    # Service: /aimdk_5Fmsgs/srv/GetMute
    
    # Request
    CommonRequest request            # Request header
    
    ---
    
    # Response
    CommonResponse response          # Response header
    bool is_mute                     # Current mute state
    
  • SetMute ros2-srv @ /hal/audio/srv/SetMute.srv

    # Set Mute
    # Service: /aimdk_5Fmsgs/srv/SetMute
    
    # Request
    CommonRequest request            # Request header
    bool is_mute                     # Target mute state
    
    ---
    
    # Response
    CommonResponse response          # Response header
    bool is_mute                     # Current mute state
    

Speech Synthesis Services

Service Name

Data Type

Description

/aimdk_5Fmsgs/srv/PlayTts

PlayTts

Text-to-speech playback

  • PlayTts ros2-srv @ interaction/srv/PlayTts.srv

    # TTS Playback
    # Service: /aimdk_5Fmsgs/srv/PlayTts
    
    # Request
    CommonRequest header
    PlayTtsRequest tts_req  # Embedded request msg
    
    ---
    
    # Response
    CommonResponse header
    PlayTtsResponse tts_resp  # Embedded response msg
    

    Where

    • PlayTtsRequest ros2-msg @ interaction/msg/PlayTtsRequest.msg

      # Embedded request msg
      
      string text                      # Text content
      TtsPriorityLevel priority_level  # Priority level (see TtsPriorityLevel below)
      uint32 priority_weight           # Priority weight (0–99)
      string domain                    # Caller domain
      string trace_id                  # Request trace ID
      bool is_interrupted              # Whether to interrupt broadcasts of the same priority (otherwise queued)
      
      • TtsPriorityLevel ros2-msg @ interaction/msg/TtsPriorityLevel.msg

        # TTS priority level
        uint8 value                      # Priority value
        

        Available TtsPriorityLevel values:

        Level

        Value

        Description

        Usage scenarios

        Emergency safety layer (SAFETY_L10)

        10

        Highest priority

        Safety alerts, emergency notifications

        Warning layer (WARNING_L8)

        8

        High priority

        Hazard alerts and warning messages

        Interaction response layer (INTERACTION_L6)

        6

        Medium-high priority

        User interaction and conversational responses

        Mission execution layer (MISSION_L4)

        4

        Medium priority

        Task execution and status broadcasts

        Service layer (SERVICE_L2)

        2

        Low priority

        Proactive services and reminders

        Background service layer (BACKGROUND_L1)

        1

        Lowest priority

        Background services and logging

        Audio playback priority mechanism:

        • This priority system applies to both TTS playback (PlayTts) and audio file playback (PlayMediaFile).

        • Higher priority playback interrupts lower priority playback.

        • For the same priority level, behavior is determined by priority_weight and is_interrupted.

        • The emergency safety level has the highest priority and cannot be interrupted by any other level.

    • PlayTtsResponse ros2-msg @ interaction/msg/PlayTtsResponse.msg

      # Embedded response msg
      string text                      # Response text
      TtsPriorityLevel priority_level  # Priority level
      uint32 priority_weight           # Priority weight
      string domain                    # Caller domain
      string trace_id                  # Request trace ID
      bool is_success                  # Whether the request succeeded
      string error_message             # Error message
      uint32 estimated_duration        # Estimated duration (ms)
      

Audio File Playback Service

Service Name

Data Type

Description

/aimdk_5Fmsgs/srv/PlayMediaFile

PlayMediaFile

Play audio file

  • PlayMediaFile ros2-srv @ interaction/srv/PlayMediaFile.srv

    # Play audio file
    # Service: /aimdk_5Fmsgs/srv/PlayMediaFile
    
    # Request
    CommonRequest header
    PlayMediaFileRequest media_file_req
    
    ---
    
    # Response
    CommonResponse header
    PlayTtsResponse tts_resp  # Reuses PlayTtsResponse
    
    • PlayMediaFileRequest ros2-msg @ interaction/msg/PlayMediaFileRequest.msg

      # Embedded request msg
      
      string file_name  # Absolute path to the audio file (must be on the interaction compute unit and readable by all)
      uint32 sample_rate  # Currently unused, default 16k1ch
      TtsPriorityLevel priority_level  # Recommended default: INTERACTION_L6
      uint32 priority_weight  # Weight (0–99)
      string domain  # Caller domain
      string trace_id  # Request trace ID
      bool is_interrupted # Whether to interrupt broadcasts of the same priority (otherwise queued)
      

      For priority_level values, see the audio priority table.

    • PlayTtsResponse as described above.

    Notes:

    • Audio files must be PCM-encoded raw files (.pcm) or WAV files wrapping this PCM data (.wav).

    • Audio must be 16 kHz sample rate, 16-bit, mono.

    • Audio and video files must use absolute paths.

    • Audio and video files must be stored on the interaction compute unit (PC3, 10.0.1.42), not the development compute unit (PC2).

    • Audio and video files (and all parent directories up to root) must be readable by all users(new subdirectory under /var/tmp/ is recommended)

MIC Audio Stream Capture Topic

Supports receiving VAD (Voice Activity Detection) events on denoised audio and the corresponding audio stream.

Topic Name

Data Type

Description

QoS

Frequency

/agent/process_audio_output

ProcessedAudioOutput

VAD audio capture

-

Event-triggered, cached data for voice recognition would be sent in aburst at start of VAD event, then would update at ~25Hz

  • ProcessedAudioOutput ros2-msg @ interaction/msg/ProcessedAudioOutput.msg

    MessageHeader header  # Message header
    
    uint32 stream_id  # Audio stream ID (1: onboard mic, 2: external mic)
    AudioVadStateType audio_vad_state  # VAD state (0: no speech, 1: speech start, 2: in speech, 3: speech end)
    uint8[] audio_data  # Audio data (PCM, 16 kHz / 16 bit / 1 ch)
    

Audio stream format:

  • Sample rate: 16 kHz

  • Bit depth: 16 bit

  • Channels: mono

  • Encoding: PCM

Programming Examples

For detailed programming examples and code descriptions, see:

Safety Notes

Warning

Voice playback limitations

  • The TTS service uses a priority system; avoid starting multiple speech playbacks at the same time.

  • Higher-priority speech will interrupt lower-priority speech; configure priorities carefully.

  • Check the current playback state before starting new speech.

Caution

As standard ROS DO NOT handle cross-host service (request-response) well, please refer to SDK examples to use open interfaces in a robust way (with protection mechanisms e.g. exception safety and retransmission)

Note

Best Practices

  • Choose appropriate priority levels to avoid interfering with important announcements.

  • Implement monitoring and exception handling for speech playback.

  • Implement a playback queue for speech management.

  • Pay attention to the required audio format and sample rate.

  • The receive queue (QoS depth) of VAD should be large enough