Multimodal Input
Karma One goes far beyond text. You can interact with AI through text, voice, images, files, and combinations of all four. This guide covers every input method available and how to get the most out of each one.
The Input Bar
The input bar is your primary interface for communicating with AI. Here is a breakdown of its components:
+---------------------------------------------------+
| [+] Message text area [Mic] [Send] |
+---------------------------------------------------+
| | |
| | +-- Send button
| +-- Microphone (voice input)
+-- Attachment menu (camera, photos, files)
| Button | Function | |---|---| | + (Attachment) | Opens the attachment menu: camera, photo library, file upload | | Microphone | Tap to start recording voice, tap again to stop and transcribe | | Send | Sends the current message (text, attachments, or both) |
Text Input
Basic Text
Type your message in the input field and tap Send (or press Enter on desktop) to send it. The input field auto-expands as you type longer messages.
Markdown Support
The input field supports Markdown formatting. Use the following syntax to structure your messages:
| Syntax | Renders As | Example |
|---|---|---|
| **bold** | bold text | **important** |
| *italic* | italic text | *note* |
| `code` | inline code | `console.log()` |
| ```language | Fenced code block | Multi-line code with syntax highlighting |
| - item | Bulleted list | - first point |
| 1. item | Numbered list | 1. step one |
| > quote | Blockquote | > This is a note |
| # Heading | Heading | Section structure |
Tip: AI responses are also rendered in Markdown. You will see formatted output with headings, tables, syntax-highlighted code blocks, and more.
Keyboard Shortcuts (Desktop / Web)
| Shortcut | Action |
|---|---|
| Enter | Send message |
| Shift + Enter | New line (without sending) |
| Ctrl/Cmd + V | Paste text or images from clipboard |
| Ctrl/Cmd + Z | Undo |
Long-Form Input
The input field automatically expands to accommodate longer text. You can paste large blocks of content such as:
- Full articles for summarization or analysis
- Long paragraphs for translation
- Entire code files for review
- Multi-page meeting notes for action item extraction
There is no practical length limit on text input, though very long messages will consume more energy.
Pasting Code
When pasting code, wrap it in triple backticks for best results:
Can you review this function for bugs?
` ` `python
def calculate_discount(price, percentage):
if percentage > 100:
return 0
discount = price * percentage / 100
return price - discount
` ` `
(Spaces added between backticks above for display purposes. Use them without spaces.)
The AI will recognize the language, apply syntax highlighting in its response, and provide code-specific feedback.
Voice Input
Voice input lets you speak your message instead of typing. The system transcribes your speech to text automatically.
How to Use
- Tap the microphone icon on the right side of the input bar.
- Grant microphone permission if prompted (first time only).
- Start speaking naturally.
- Tap the microphone icon again (or release if using hold-to-talk) to stop recording.
- The system transcribes your speech to text.
- The transcribed text appears in the input field. Review it and tap Send.
Supported Languages
Voice input supports automatic language detection across many languages:
| Language | Recognition Quality | |---|---| | English | Excellent | | Chinese (Mandarin) | Excellent | | Japanese | Good | | Korean | Good | | French | Good | | German | Good | | Spanish | Good | | Portuguese | Good | | Russian | Good | | Arabic | Good |
The system automatically detects which language you are speaking. No manual switching required. You can even mix languages within a single utterance (for example, English with occasional Chinese terms), and the system handles it correctly.
Voice Input Best Practices
Use a quiet environment. Background noise reduces transcription accuracy. In noisy settings, use a headset or earbuds with a microphone.
Speak at a natural pace. No need to slow down or speed up. Normal conversational speed works best.
Use complete sentences. The transcription engine works better with full sentences than with fragments.
Good: "Translate this paragraph into French and keep the tone formal."
Poor: "Translate... French... formal."
Combine voice and text. Speak the main content via voice, then edit the transcription in the input field before sending. This is efficient for quick drafts that need minor corrections.
Voice-to-Voice (via Telegram)
When connected to the Karma Telegram Bot, you can have a full voice-to-voice experience:
- Send a voice message to
@karmabox7botin Telegram. - The bot transcribes your message.
- The AI generates a response.
- For short responses (under 500 characters), the bot replies with a voice message.
- For longer responses, the bot replies with text.
Camera and Image Upload
Camera Capture
Take a photo directly from the app and send it to the AI for analysis:
- Tap the + button in the input bar.
- Select Take Photo (or Camera).
- Point and capture whatever you want the AI to analyze.
- Confirm the photo.
- Optionally add a text question in the input field.
- Tap Send.
Scenarios for camera capture:
| Scenario | Example Prompt | |---|---| | Identify objects | "What species of plant is this?" | | Translate signs/menus | "Translate this restaurant menu for me" | | Solve problems | "Help me solve this math problem" | | Digitize handwriting | "Transcribe this handwritten note into digital text" | | Price comparison | "How much does this product typically cost online?" | | Receipt scanning | "Extract the date, merchant, and total from this receipt" | | Code debugging | "What is wrong with the code on my screen?" |
Select from Photo Library
Upload existing photos from your device:
- Tap the + button.
- Select Choose from Library (or Photo Library).
- Pick one or more images.
- Add your question in the input field.
- Tap Send.
Paste Images (Desktop)
On desktop or web, copy an image to your clipboard and press Ctrl/Cmd + V to paste it directly into the input area. This works with:
- Screenshots (system screenshot tool)
- Images copied from web pages
- Images copied from other applications
- Snipped regions from screen capture tools
AI Image Understanding
Karma One supports comprehensive visual understanding:
| Capability | Description | |---|---| | Object recognition | Identify items, products, animals, plants in photos | | Text extraction (OCR) | Read printed and handwritten text from images | | Scene understanding | Describe the overall scene, context, and setting | | Chart analysis | Interpret bar charts, line graphs, pie charts, flowcharts | | Image comparison | Compare differences across multiple images | | Screenshot analysis | Analyze UI screenshots, error messages, code on screen | | Handwriting recognition | Read handwritten text, equations, and diagrams | | Document understanding | Parse structured documents (invoices, forms, tables) |
Best practice: always pair images with specific questions.
Sending an image without context produces a generic description. Adding a targeted question produces a focused, useful answer.
[Upload a chart image]
Poor: (no text, just the image)
Better: "This chart shows our Q3 sales by region. Which region grew fastest
and what might explain the dip in July?"
Tip: For image understanding tasks, Gemini 2.5 Pro generally delivers the best multimodal performance. Consider switching to it when working heavily with visual content.
File Upload
Supported File Types
| Category | Supported Formats | Use Cases | |---|---|---| | Documents | PDF, DOC, DOCX | Contracts, reports, research papers | | Presentations | PPT, PPTX | Training materials, slide decks | | Spreadsheets | XLS, XLSX, CSV | Financial data, analytics, tables | | Text | TXT, MD | Plain text, Markdown notes | | Data | JSON, JSONL | Structured data, API responses, datasets | | Code | .py, .js, .ts, .java, .go, .rs, etc. | Code review, debugging, refactoring | | Images | PNG, JPG, JPEG, GIF, WebP, SVG, BMP | Photos, screenshots, designs | | Audio | MP3, WAV, FLAC, AAC, OGG, M4A | Voice memos, recordings | | Video | MP4, AVI, MOV, MKV, WebM | Video analysis, frame extraction |
Upload Methods
Method 1: Attachment button (all platforms)
- Tap the + button in the input bar.
- Select Upload File (or Document).
- Choose a file from your file manager.
- Wait for the upload indicator to complete.
- Add your question in the text field and tap Send.
Method 2: Paste from clipboard (desktop)
Use Ctrl/Cmd + V to paste images directly from your clipboard into the input area.
Method 3: Drag and drop (desktop/web)
Drag files from your file explorer (Finder, Windows Explorer) directly into the conversation window.
Multi-File Upload
You can upload and send multiple files at once:
- In the file picker, select multiple files (hold
Ctrl/Cmdto multi-select). - Or upload files one at a time -- they queue up in the input area.
- Once all files show as ready, type your question and send.
Multi-file example prompts:
[Upload: Q1-sales.xlsx, Q2-sales.xlsx, Q3-sales.xlsx]
Compare sales data across these three quarters. Identify the fastest-growing
product line and the region with the steepest decline.
[Upload: contract-v1.pdf, contract-v2.pdf]
Compare the key terms in these two contract versions. Focus on payment
conditions, liability clauses, and termination terms. Present differences
in a table.
[Upload: app.py, utils.py, models.py]
Review this Python codebase for potential security vulnerabilities and
suggest fixes.
File Processing Pipeline
Uploaded files go through several processing stages:
- Upload: File transferred to the server
- Parse: Extract text, tables, images, and structured content
- Chunk: Split long documents into manageable segments
- Ready: AI can now answer questions based on the file content
You can see real-time processing progress during upload. Once the status shows ready, send your question.
File Size Limits
| Plan | Max File Size | Files Per Message | |---|---|---| | Free | 5 MB | 1 | | Starter | 20 MB | 3 | | Pro | 50 MB | 10 | | Team | 100 MB | 20 | | Enterprise | 500 MB | Unlimited |
Combining Input Types
The real power of multimodal input comes from combining different types in a single message.
Text + Image
The most common combination. Upload an image and add a text description or question.
[A product photo]
Write 3 social media captions for this product targeting students,
professionals, and families respectively.
Text + File
Upload a document and ask a specific question about it.
[An annual report PDF]
Summarize the key takeaways from this annual report. Focus on revenue
growth, profit margin changes, and forward guidance.
Text + Multiple Images
Upload several images for comparison analysis.
[Old design screenshot] [New design screenshot]
Compare the UI design between these two versions. List the major changes
and evaluate whether the redesign improves usability.
Voice + Image
Take a photo first, then describe your question using voice input. This is especially convenient when your hands are occupied or when you are on the go.
Text + Image + File
For complex analysis, combine all three:
[Product mockup image] [Requirements PDF]
Does this mockup correctly implement all the requirements in the attached
spec? List any discrepancies.
Frequently Asked Questions
How long are uploaded files retained?
Files uploaded during a conversation are retained for the duration of that conversation. You can reference previously uploaded files in subsequent messages within the same conversation. For long-term storage, add files to an avatar's knowledge base instead.
Can I upload password-protected PDFs?
No. Encrypted or password-protected PDF files cannot be parsed. Remove the password protection before uploading.
What is the difference between voice input and voice-to-voice chat?
Voice input converts your speech to text, and the AI responds in text. Voice-to-voice chat (available through the Telegram integration) is a full audio experience where the AI also responds with a voice message.
Can the AI remember images from earlier in the conversation?
Within the current conversation, the AI can reference content from previously uploaded images. However, the avatar's long-term memory stores text descriptions of image content rather than the raw image data.
Which model is best for image understanding?
Gemini 2.5 Pro is generally the strongest for visual tasks. It handles charts, documents, screenshots, and natural images well. Claude Sonnet 4 and Claude Opus 4 also support image input with good quality.
Can I use voice and text together in one message?
Yes. Start with voice input to get a transcription, then edit or append text in the input field before sending. The two modes work seamlessly together.