Overview
Translators convert text from one language to another. BallonTranslator supports both online API-based and offline model-based translators through a unified interface.
BaseTranslator Class
Base class for all translator modules.
Import
from modules.translators.base import BaseTranslator, TRANSLATORS, register_translator
Class Definition
class BaseTranslator(BaseModule):
"""
Base class for translation modules.
Handles:
- Language mapping and validation
- Text/TextBlock translation
- Pre/post-processing hooks
- Text concatenation for batch translation
"""
concate_text = True # Concatenate text list for batch translation
cht_require_convert = False # Auto-enable Traditional Chinese via conversion
_preprocess_hooks = OrderedDict()
_postprocess_hooks = OrderedDict()
Constructor
Initialize translator with source and target languages.Parameters:
lang_source (str): Source language (e.g., ‘Auto’, ‘English’, ‘日本語’)
lang_target (str): Target language
raise_unsupported_lang (bool): Raise error for unsupported languages (default: True)
**params: Additional module parameters
translator = BaseTranslator(
lang_source='日本語',
lang_target='English'
)
Required Methods
Subclasses must implement these methods:
_setup_translator
Initialize the translator. Configure language mappings and setup API clients or models.def _setup_translator(self):
# Set up language mappings
self.lang_map['English'] = 'en'
self.lang_map['日本語'] = 'ja'
self.lang_map['简体中文'] = 'zh-CN'
# Initialize translator (API client, model, etc.)
self.client = TranslationAPI(api_key=self.get_param_value('api_key'))
_translate
Translate a list of strings.Parameters:
src_list (List[str]): List of source texts
Returns: List[str] - List of translations (same length as input)def _translate(self, src_list: List[str]) -> List[str]:
# Translate all texts
source_lang = self.lang_map[self.lang_source]
target_lang = self.lang_map[self.lang_target]
translations = []
for text in src_list:
result = self.client.translate(text, source_lang, target_lang)
translations.append(result)
return translations
Core Methods
translate
Translate text or list of texts.Parameters:
text (Union[str, List[str]]): Text(s) to translate
Returns: Union[str, List[str]] - Translation(s)# Single text
translation = translator.translate("Hello world")
# Multiple texts
translations = translator.translate(["Hello", "World"])
The translate method automatically:
- Handles empty text
- Concatenates text list if
concate_text=True
- Validates output length matches input length
- Applies pre/post-processing hooks
translate_textblk_lst
Translate a list of TextBlocks.Parameters:
textblk_lst (List[TextBlock]): Text blocks to translate
Side Effects:
- Sets
blk.translation attribute on each TextBlock
# After OCR
translator.translate_textblk_lst(text_blocks)
for blk in text_blocks:
print(f"Original: {blk.get_text()}")
print(f"Translation: {blk.translation}")
Language Management
set_source
Set source language.Parameters:
lang (str): Language name (must be in supported_src_list)
Raises: InvalidSourceOrTargetLanguage if unsupportedtranslator.set_source('日本語')
set_target
Set target language.Parameters:
lang (str): Language name (must be in supported_tgt_list)
Raises: InvalidSourceOrTargetLanguage if unsupportedtranslator.set_target('English')
Properties
Translator name from registry. Automatically set during initialization.
Mapping from display names to translator-specific language codes.# Example
{'English': 'en', '日本語': 'ja', 'Auto': ''}
Get list of all supported languages.Returns: List[str]
List of supported source languages. Override to customize.Returns: List[str]
List of supported target languages. Override to customize.Returns: List[str]
Utility Methods
textlist2text
Concatenate text list into a single string for batch translation.Parameters:
text_list (List[str]): List of texts
Returns: str - Concatenated text with separator# Uses self.textblk_break as separator (default: '\n##\n')
combined = translator.textlist2text(["Hello", "World"])
# "Hello\n##\nWorld"
text2textlist
Split concatenated text back into list.Parameters:
text (str): Concatenated translation
Returns: List[str] - List of individual translationstranslations = translator.text2textlist("Hello\n##\nWorld")
# ["Hello", "World"]
delay
Get delay between requests (for rate limiting).Returns: float - Delay in secondsimport time
time.sleep(translator.delay())
Language Map
Global language mapping with standard display names:
from modules.translators.base import LANGMAP_GLOBAL
print(LANGMAP_GLOBAL.keys())
# Auto, 简体中文, 繁體中文, 日本語, English, 한국어,
# Tiếng Việt, čeština, Nederlands, Français, Deutsch,
# magyar nyelv, Italiano, Polski, Português, Brazilian Portuguese,
# limba română, русский язык, Español, Türk dili,
# українська мова, Thai, Arabic, Hindi, Malayalam, Tamil
Built-in Translators
Google Translate
from modules.translators.trans_google import TransGoogle
translator = TransGoogle(
lang_source='日本語',
lang_target='English'
)
translation = translator.translate("こんにちは")
print(translation) # "Hello"
Features
- Free, no API key required
- Supports most languages
concate_text = False (translates individually)
M2M100 (Offline)
Local neural translation model.
from modules.translators.trans_m2m100 import M2M100Translator
translator = M2M100Translator(
lang_source='Japanese',
lang_target='English',
device='cuda'
)
Features
- Fully offline
- 100+ language support
- Requires model download
- GPU acceleration
Example Implementations
Simple API Translator
import requests
from typing import List
from modules.translators.base import BaseTranslator, register_translator
@register_translator('my_translator')
class MyTranslator(BaseTranslator):
concate_text = False
params = {
'api_key': '',
'delay': 0.5,
}
def _setup_translator(self):
"""Setup language mappings."""
self.lang_map['Auto'] = 'auto'
self.lang_map['English'] = 'en'
self.lang_map['日本語'] = 'ja'
self.lang_map['简体中文'] = 'zh'
# Get API key from params
self.api_key = self.get_param_value('api_key')
if not self.api_key:
raise MissingTranslatorParams('api_key is required')
def _translate(self, src_list: List[str]) -> List[str]:
"""Translate text list."""
source_lang = self.lang_map[self.lang_source]
target_lang = self.lang_map[self.lang_target]
translations = []
for text in src_list:
response = requests.post(
'https://api.mytranslator.com/translate',
json={
'text': text,
'source': source_lang,
'target': target_lang
},
headers={'Authorization': f'Bearer {self.api_key}'}
)
result = response.json()['translation']
translations.append(result)
return translations
Offline Model Translator
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from modules.translators.base import BaseTranslator, register_translator, DEVICE_SELECTOR
@register_translator('offline_translator')
class OfflineTranslator(BaseTranslator):
concate_text = True
params = {
'device': DEVICE_SELECTOR()
}
_load_model_keys = {'model', 'tokenizer'}
def __init__(self, **params):
super().__init__(**params)
self.model = None
self.tokenizer = None
def _setup_translator(self):
"""Setup language mappings."""
self.lang_map['English'] = 'en'
self.lang_map['日本語'] = 'ja'
self.lang_map['简体中文'] = 'zh'
def _load_model(self):
"""Load translation model."""
device = self.get_param_value('device')
model_path = 'data/models/my-translation-model'
self.tokenizer = AutoTokenizer.from_pretrained(model_path)
self.model = AutoModelForSeq2SeqLM.from_pretrained(model_path)
self.model.to(device)
self.model.eval()
def _translate(self, src_list: List[str]) -> List[str]:
"""Translate using local model."""
if not self.all_model_loaded():
self.load_model()
device = self.get_param_value('device')
source_lang = self.lang_map[self.lang_source]
target_lang = self.lang_map[self.lang_target]
# Prepare input
inputs = self.tokenizer(
src_list,
return_tensors='pt',
padding=True,
truncation=True
).to(device)
# Translate
with torch.no_grad():
outputs = self.model.generate(
**inputs,
forced_bos_token_id=self.tokenizer.lang_code_to_id[target_lang]
)
# Decode
translations = self.tokenizer.batch_decode(outputs, skip_special_tokens=True)
return translations
@property
def supported_src_list(self):
return ['English', '日本語', '简体中文']
@property
def supported_tgt_list(self):
return ['English', '日本語', '简体中文']
Advanced: With Custom Language Support
@register_translator('advanced_translator')
class AdvancedTranslator(BaseTranslator):
def _setup_translator(self):
# Standard languages
self.lang_map['English'] = 'en'
self.lang_map['日本語'] = 'ja'
# Auto-enable Traditional Chinese via conversion from Simplified
self.cht_require_convert = True
self.lang_map['简体中文'] = 'zh-CN'
# Now '繁體中文' is automatically supported
@property
def supported_src_list(self):
"""Custom source language list."""
# All languages except Auto
return [lang for lang in self.valid_lang_list if lang != 'Auto']
@property
def supported_tgt_list(self):
"""Custom target language list."""
# All languages except Auto
return [lang for lang in self.valid_lang_list if lang != 'Auto']
Hooks
Preprocessing Hooks
def preprocess_hook(translations: List[str], textblocks: List[TextBlock],
translator: BaseTranslator, source_text: List[str]):
"""Modify source text before translation."""
# Example: Remove special characters
for i, text in enumerate(source_text):
source_text[i] = text.replace('*', '').replace('~', '')
# Register globally
BaseTranslator.register_preprocess_hooks(preprocess_hook)
# Or for specific translator
MyTranslator.register_preprocess_hooks(preprocess_hook)
Postprocessing Hooks
def postprocess_hook(translations: List[str], textblocks: List[TextBlock],
translator: BaseTranslator):
"""Modify translations after translation."""
# Example: Capitalize sentences
for i, text in enumerate(translations):
translations[i] = text.capitalize()
BaseTranslator.register_postprocess_hooks(postprocess_hook)
Error Handling
Custom Exceptions
from modules.translators.base import (
InvalidSourceOrTargetLanguage,
TranslatorSetupFailure,
MissingTranslatorParams
)
try:
translator = MyTranslator(
lang_source='InvalidLang',
lang_target='English'
)
except InvalidSourceOrTargetLanguage as e:
print(f"Unsupported language: {e}")
print(f"Supported: {e.message}") # List of valid languages
try:
translator = MyTranslator(
lang_source='English',
lang_target='日本語'
)
except MissingTranslatorParams as e:
print(f"Missing parameter: {e}")
try:
translator = MyTranslator(
lang_source='English',
lang_target='日本語',
api_key='invalid'
)
except TranslatorSetupFailure as e:
print(f"Setup failed: {e}")
Best Practices
1. Text Concatenation
# For translators that support batch translation
class BatchTranslator(BaseTranslator):
concate_text = True # Concatenate for batch API call
def _translate(self, src_list: List[str]) -> List[str]:
# src_list is already concatenated into single string
# by textlist2text()
pass
# For translators that handle individual texts
class SingleTranslator(BaseTranslator):
concate_text = False # Translate each individually
def _translate(self, src_list: List[str]) -> List[str]:
# src_list is a list of individual strings
return [self.translate_single(text) for text in src_list]
2. Rate Limiting
import time
class RateLimitedTranslator(BaseTranslator):
params = {
'delay': 0.5 # Seconds between requests
}
def _translate(self, src_list: List[str]) -> List[str]:
translations = []
for text in src_list:
translation = self.api_call(text)
translations.append(translation)
# Respect rate limit
time.sleep(self.delay())
return translations
3. Error Recovery
class RobustTranslator(BaseTranslator):
def _translate(self, src_list: List[str]) -> List[str]:
translations = []
for text in src_list:
try:
result = self.api_call(text)
translations.append(result)
except Exception as e:
self.logger.error(f"Translation failed: {e}")
# Return original text or empty string
translations.append(text) # or ""
return translations
4. Caching
from functools import lru_cache
class CachedTranslator(BaseTranslator):
@lru_cache(maxsize=1000)
def translate_cached(self, text: str) -> str:
"""Cache translation results."""
result = self._translate([text])[0]
return result
def _translate(self, src_list: List[str]) -> List[str]:
# Use cache for repeated translations
return [self.translate_cached(text) for text in src_list]
Registry Usage
Listing Translators
from modules.base import init_translator_registries
from modules.translators.base import TRANSLATORS
init_translator_registries()
print("Available translators:")
for name in TRANSLATORS.module_dict:
print(f" - {name}")
Dynamic Selection
def get_translator(name: str, source: str, target: str, **params):
"""Get translator by name."""
if name not in TRANSLATORS:
raise ValueError(f"Unknown translator: {name}")
translator_class = TRANSLATORS[name]
return translator_class(
lang_source=source,
lang_target=target,
**params
)
# Usage
translator = get_translator('google', '日本語', 'English')