SJinn
Image Lipsync (Photo Talk)

Image Lipsync (Photo Talk)

Generate talking-head videos from a portrait image and audio using AI lip-sync technology

Tool Overview

The Image Lipsync (Photo Talk) tool generates talking-head videos by combining a portrait image with audio input. It uses AI-driven lip-sync technology to animate the character's mouth, facial expressions, and subtle movements to match the provided audio, creating a realistic speaking video.

Tool Identifier

image-lipsync-api

Parameters

Required Parameters

image

  • Type: string (required)
  • Description: Portrait image URL for the character to be animated
  • Format: Must be a complete HTTP/HTTPS URL (e.g., https://cdn.sjinn.ai/uploads/portrait.jpg)
  • Validation: Must be a non-empty, full URL starting with http:// or https://

audio

  • Type: string (required)
  • Description: Audio URL containing the speech to drive the lip-sync animation
  • Format: Must be a complete HTTP/HTTPS URL (e.g., https://cdn.sjinn.ai/uploads/speech.mp3)
  • Validation: Must be a non-empty, full URL starting with http:// or https://
  • Limits: Audio duration must not exceed 600 seconds (10 minutes)

prompt

  • Type: string (required)
  • Description: Text description to guide the animation style and character expression
  • Validation: Must be a non-empty string
  • Example: "A young woman speaking naturally with gentle expressions"

Pricing

  • Credits Consumed: 30 * audio_duration_in_seconds credits per task
  • Example: A 10-second audio costs 30 * 10 = 300 credits
  • Membership Requirement: None

Request Examples

Basic Usage

curl -X POST https://sjinn.ai/api/un-api/create_tool_task \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "tool_type": "image-lipsync-api",
    "input": {
      "image": "https://cdn.sjinn.ai/uploads/portrait.jpg",
      "audio": "https://cdn.sjinn.ai/uploads/speech.mp3",
      "prompt": "A young woman speaking naturally with gentle expressions"
    }
  }'

Using JavaScript

const response = await fetch('https://sjinn.ai/api/un-api/create_tool_task', {
  method: 'POST',
  headers: {
    'Authorization': 'Bearer YOUR_API_KEY',
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    tool_type: 'image-lipsync-api',
    input: {
      image: 'https://cdn.sjinn.ai/uploads/portrait.jpg',
      audio: 'https://cdn.sjinn.ai/uploads/speech.mp3',
      prompt: 'A young woman speaking naturally with gentle expressions',
    },
  }),
});

const result = await response.json();
console.log('Task ID:', result.data.task_id);

Using Python

import requests

url = 'https://sjinn.ai/api/un-api/create_tool_task'
headers = {
    'Authorization': 'Bearer YOUR_API_KEY',
    'Content-Type': 'application/json'
}
data = {
    'tool_type': 'image-lipsync-api',
    'input': {
        'image': 'https://cdn.sjinn.ai/uploads/portrait.jpg',
        'audio': 'https://cdn.sjinn.ai/uploads/speech.mp3',
        'prompt': 'A young woman speaking naturally with gentle expressions'
    }
}

response = requests.post(url, json=data, headers=headers)
result = response.json()
print('Task ID:', result['data']['task_id'])

Response Examples

Success Response

{
  "success": true,
  "errorMsg": "",
  "error_code": 0,
  "data": {
    "task_id": "550e8400-e29b-41d4-a716-446655440000"
  }
}

Error Response

{
  "success": false,
  "errorMsg": "image must be a full URL (http:// or https://)",
  "error_code": 400
}
{
  "success": false,
  "errorMsg": "Audio duration exceeds maximum limit of 600 seconds, your audio duration is 720",
  "error_code": 101
}

Best Practices

  1. Image Quality: Use a clear, front-facing portrait image for the best lip-sync results. Avoid images with occlusions on the face area.
  2. Audio Quality: Use clean audio with minimal background noise. Clear speech produces more accurate lip-sync animations.
  3. Audio Duration: Keep audio under 600 seconds. Shorter audio clips (under 60 seconds) tend to produce higher quality results.
  4. Prompt Tips: Describe the character's expression and speaking style in the prompt to guide the animation (e.g., "speaking confidently", "with a warm smile").
  5. Generation Time: Generation time depends on audio duration. Short clips (under 30 seconds) typically complete in 1-3 minutes; longer clips may take 5-10 minutes.

Related Links