Skip to main content

Overview

Translators convert text from one language to another. BallonTranslator supports both online API-based and offline model-based translators through a unified interface.

BaseTranslator Class

Base class for all translator modules.

Import

from modules.translators.base import BaseTranslator, TRANSLATORS, register_translator

Class Definition

class BaseTranslator(BaseModule):
    """
    Base class for translation modules.
    
    Handles:
    - Language mapping and validation
    - Text/TextBlock translation
    - Pre/post-processing hooks
    - Text concatenation for batch translation
    """
    
    concate_text = True          # Concatenate text list for batch translation
    cht_require_convert = False  # Auto-enable Traditional Chinese via conversion
    
    _preprocess_hooks = OrderedDict()
    _postprocess_hooks = OrderedDict()

Constructor

__init__
method
Initialize translator with source and target languages.Parameters:
  • lang_source (str): Source language (e.g., ‘Auto’, ‘English’, ‘日本語’)
  • lang_target (str): Target language
  • raise_unsupported_lang (bool): Raise error for unsupported languages (default: True)
  • **params: Additional module parameters
translator = BaseTranslator(
    lang_source='日本語',
    lang_target='English'
)

Required Methods

Subclasses must implement these methods:

_setup_translator

_setup_translator
method
Initialize the translator. Configure language mappings and setup API clients or models.
def _setup_translator(self):
    # Set up language mappings
    self.lang_map['English'] = 'en'
    self.lang_map['日本語'] = 'ja'
    self.lang_map['简体中文'] = 'zh-CN'
    
    # Initialize translator (API client, model, etc.)
    self.client = TranslationAPI(api_key=self.get_param_value('api_key'))

_translate

_translate
method
Translate a list of strings.Parameters:
  • src_list (List[str]): List of source texts
Returns: List[str] - List of translations (same length as input)
def _translate(self, src_list: List[str]) -> List[str]:
    # Translate all texts
    source_lang = self.lang_map[self.lang_source]
    target_lang = self.lang_map[self.lang_target]
    
    translations = []
    for text in src_list:
        result = self.client.translate(text, source_lang, target_lang)
        translations.append(result)
    
    return translations

Core Methods

translate

translate
method
Translate text or list of texts.Parameters:
  • text (Union[str, List[str]]): Text(s) to translate
Returns: Union[str, List[str]] - Translation(s)
# Single text
translation = translator.translate("Hello world")

# Multiple texts
translations = translator.translate(["Hello", "World"])
The translate method automatically:
  • Handles empty text
  • Concatenates text list if concate_text=True
  • Validates output length matches input length
  • Applies pre/post-processing hooks

translate_textblk_lst

translate_textblk_lst
method
Translate a list of TextBlocks.Parameters:
  • textblk_lst (List[TextBlock]): Text blocks to translate
Side Effects:
  • Sets blk.translation attribute on each TextBlock
# After OCR
translator.translate_textblk_lst(text_blocks)

for blk in text_blocks:
    print(f"Original: {blk.get_text()}")
    print(f"Translation: {blk.translation}")

Language Management

set_source

set_source
method
Set source language.Parameters:
  • lang (str): Language name (must be in supported_src_list)
Raises: InvalidSourceOrTargetLanguage if unsupported
translator.set_source('日本語')

set_target

set_target
method
Set target language.Parameters:
  • lang (str): Language name (must be in supported_tgt_list)
Raises: InvalidSourceOrTargetLanguage if unsupported
translator.set_target('English')

Properties

name
str
Translator name from registry. Automatically set during initialization.
lang_source
str
Current source language.
lang_target
str
Current target language.
lang_map
Dict[str, str]
Mapping from display names to translator-specific language codes.
# Example
{'English': 'en', '日本語': 'ja', 'Auto': ''}
supported_languages
method
Get list of all supported languages.Returns: List[str]
supported_src_list
property
List of supported source languages. Override to customize.Returns: List[str]
supported_tgt_list
property
List of supported target languages. Override to customize.Returns: List[str]

Utility Methods

textlist2text

textlist2text
method
Concatenate text list into a single string for batch translation.Parameters:
  • text_list (List[str]): List of texts
Returns: str - Concatenated text with separator
# Uses self.textblk_break as separator (default: '\n##\n')
combined = translator.textlist2text(["Hello", "World"])
# "Hello\n##\nWorld"

text2textlist

text2textlist
method
Split concatenated text back into list.Parameters:
  • text (str): Concatenated translation
Returns: List[str] - List of individual translations
translations = translator.text2textlist("Hello\n##\nWorld")
# ["Hello", "World"]

delay

delay
method
Get delay between requests (for rate limiting).Returns: float - Delay in seconds
import time
time.sleep(translator.delay())

Language Map

Global language mapping with standard display names:
from modules.translators.base import LANGMAP_GLOBAL

print(LANGMAP_GLOBAL.keys())
# Auto, 简体中文, 繁體中文, 日本語, English, 한국어, 
# Tiếng Việt, čeština, Nederlands, Français, Deutsch, 
# magyar nyelv, Italiano, Polski, Português, Brazilian Portuguese,
# limba română, русский язык, Español, Türk dili, 
# українська мова, Thai, Arabic, Hindi, Malayalam, Tamil

Built-in Translators

Google Translate

from modules.translators.trans_google import TransGoogle

translator = TransGoogle(
    lang_source='日本語',
    lang_target='English'
)

translation = translator.translate("こんにちは")
print(translation)  # "Hello"

Features

  • Free, no API key required
  • Supports most languages
  • concate_text = False (translates individually)

M2M100 (Offline)

Local neural translation model.
from modules.translators.trans_m2m100 import M2M100Translator

translator = M2M100Translator(
    lang_source='Japanese',
    lang_target='English',
    device='cuda'
)

Features

  • Fully offline
  • 100+ language support
  • Requires model download
  • GPU acceleration

Example Implementations

Simple API Translator

import requests
from typing import List
from modules.translators.base import BaseTranslator, register_translator

@register_translator('my_translator')
class MyTranslator(BaseTranslator):
    
    concate_text = False
    params = {
        'api_key': '',
        'delay': 0.5,
    }
    
    def _setup_translator(self):
        """Setup language mappings."""
        self.lang_map['Auto'] = 'auto'
        self.lang_map['English'] = 'en'
        self.lang_map['日本語'] = 'ja'
        self.lang_map['简体中文'] = 'zh'
        
        # Get API key from params
        self.api_key = self.get_param_value('api_key')
        if not self.api_key:
            raise MissingTranslatorParams('api_key is required')
    
    def _translate(self, src_list: List[str]) -> List[str]:
        """Translate text list."""
        source_lang = self.lang_map[self.lang_source]
        target_lang = self.lang_map[self.lang_target]
        
        translations = []
        for text in src_list:
            response = requests.post(
                'https://api.mytranslator.com/translate',
                json={
                    'text': text,
                    'source': source_lang,
                    'target': target_lang
                },
                headers={'Authorization': f'Bearer {self.api_key}'}
            )
            result = response.json()['translation']
            translations.append(result)
        
        return translations

Offline Model Translator

import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from modules.translators.base import BaseTranslator, register_translator, DEVICE_SELECTOR

@register_translator('offline_translator')
class OfflineTranslator(BaseTranslator):
    
    concate_text = True
    params = {
        'device': DEVICE_SELECTOR()
    }
    
    _load_model_keys = {'model', 'tokenizer'}
    
    def __init__(self, **params):
        super().__init__(**params)
        self.model = None
        self.tokenizer = None
    
    def _setup_translator(self):
        """Setup language mappings."""
        self.lang_map['English'] = 'en'
        self.lang_map['日本語'] = 'ja'
        self.lang_map['简体中文'] = 'zh'
    
    def _load_model(self):
        """Load translation model."""
        device = self.get_param_value('device')
        model_path = 'data/models/my-translation-model'
        
        self.tokenizer = AutoTokenizer.from_pretrained(model_path)
        self.model = AutoModelForSeq2SeqLM.from_pretrained(model_path)
        self.model.to(device)
        self.model.eval()
    
    def _translate(self, src_list: List[str]) -> List[str]:
        """Translate using local model."""
        if not self.all_model_loaded():
            self.load_model()
        
        device = self.get_param_value('device')
        source_lang = self.lang_map[self.lang_source]
        target_lang = self.lang_map[self.lang_target]
        
        # Prepare input
        inputs = self.tokenizer(
            src_list,
            return_tensors='pt',
            padding=True,
            truncation=True
        ).to(device)
        
        # Translate
        with torch.no_grad():
            outputs = self.model.generate(
                **inputs,
                forced_bos_token_id=self.tokenizer.lang_code_to_id[target_lang]
            )
        
        # Decode
        translations = self.tokenizer.batch_decode(outputs, skip_special_tokens=True)
        
        return translations
    
    @property
    def supported_src_list(self):
        return ['English', '日本語', '简体中文']
    
    @property
    def supported_tgt_list(self):
        return ['English', '日本語', '简体中文']

Advanced: With Custom Language Support

@register_translator('advanced_translator')
class AdvancedTranslator(BaseTranslator):
    
    def _setup_translator(self):
        # Standard languages
        self.lang_map['English'] = 'en'
        self.lang_map['日本語'] = 'ja'
        
        # Auto-enable Traditional Chinese via conversion from Simplified
        self.cht_require_convert = True
        self.lang_map['简体中文'] = 'zh-CN'
        # Now '繁體中文' is automatically supported
    
    @property
    def supported_src_list(self):
        """Custom source language list."""
        # All languages except Auto
        return [lang for lang in self.valid_lang_list if lang != 'Auto']
    
    @property
    def supported_tgt_list(self):
        """Custom target language list."""
        # All languages except Auto
        return [lang for lang in self.valid_lang_list if lang != 'Auto']

Hooks

Preprocessing Hooks

def preprocess_hook(translations: List[str], textblocks: List[TextBlock], 
                    translator: BaseTranslator, source_text: List[str]):
    """Modify source text before translation."""
    # Example: Remove special characters
    for i, text in enumerate(source_text):
        source_text[i] = text.replace('*', '').replace('~', '')

# Register globally
BaseTranslator.register_preprocess_hooks(preprocess_hook)

# Or for specific translator
MyTranslator.register_preprocess_hooks(preprocess_hook)

Postprocessing Hooks

def postprocess_hook(translations: List[str], textblocks: List[TextBlock], 
                     translator: BaseTranslator):
    """Modify translations after translation."""
    # Example: Capitalize sentences
    for i, text in enumerate(translations):
        translations[i] = text.capitalize()

BaseTranslator.register_postprocess_hooks(postprocess_hook)

Error Handling

Custom Exceptions

from modules.translators.base import (
    InvalidSourceOrTargetLanguage,
    TranslatorSetupFailure,
    MissingTranslatorParams
)

try:
    translator = MyTranslator(
        lang_source='InvalidLang',
        lang_target='English'
    )
except InvalidSourceOrTargetLanguage as e:
    print(f"Unsupported language: {e}")
    print(f"Supported: {e.message}")  # List of valid languages

try:
    translator = MyTranslator(
        lang_source='English',
        lang_target='日本語'
    )
except MissingTranslatorParams as e:
    print(f"Missing parameter: {e}")

try:
    translator = MyTranslator(
        lang_source='English',
        lang_target='日本語',
        api_key='invalid'
    )
except TranslatorSetupFailure as e:
    print(f"Setup failed: {e}")

Best Practices

1. Text Concatenation

# For translators that support batch translation
class BatchTranslator(BaseTranslator):
    concate_text = True  # Concatenate for batch API call
    
    def _translate(self, src_list: List[str]) -> List[str]:
        # src_list is already concatenated into single string
        # by textlist2text()
        pass

# For translators that handle individual texts
class SingleTranslator(BaseTranslator):
    concate_text = False  # Translate each individually
    
    def _translate(self, src_list: List[str]) -> List[str]:
        # src_list is a list of individual strings
        return [self.translate_single(text) for text in src_list]

2. Rate Limiting

import time

class RateLimitedTranslator(BaseTranslator):
    params = {
        'delay': 0.5  # Seconds between requests
    }
    
    def _translate(self, src_list: List[str]) -> List[str]:
        translations = []
        for text in src_list:
            translation = self.api_call(text)
            translations.append(translation)
            
            # Respect rate limit
            time.sleep(self.delay())
        
        return translations

3. Error Recovery

class RobustTranslator(BaseTranslator):
    
    def _translate(self, src_list: List[str]) -> List[str]:
        translations = []
        
        for text in src_list:
            try:
                result = self.api_call(text)
                translations.append(result)
            except Exception as e:
                self.logger.error(f"Translation failed: {e}")
                # Return original text or empty string
                translations.append(text)  # or ""
        
        return translations

4. Caching

from functools import lru_cache

class CachedTranslator(BaseTranslator):
    
    @lru_cache(maxsize=1000)
    def translate_cached(self, text: str) -> str:
        """Cache translation results."""
        result = self._translate([text])[0]
        return result
    
    def _translate(self, src_list: List[str]) -> List[str]:
        # Use cache for repeated translations
        return [self.translate_cached(text) for text in src_list]

Registry Usage

Listing Translators

from modules.base import init_translator_registries
from modules.translators.base import TRANSLATORS

init_translator_registries()

print("Available translators:")
for name in TRANSLATORS.module_dict:
    print(f"  - {name}")

Dynamic Selection

def get_translator(name: str, source: str, target: str, **params):
    """Get translator by name."""
    if name not in TRANSLATORS:
        raise ValueError(f"Unknown translator: {name}")
    
    translator_class = TRANSLATORS[name]
    return translator_class(
        lang_source=source,
        lang_target=target,
        **params
    )

# Usage
translator = get_translator('google', '日本語', 'English')