ChatGPT can hear and see now

ChatGPT Can Hear And See Now
ChatGPT Can Hear And See Now

OpenAI has just made a groundbreaking announcement, introducing multimodal capabilities for ChatGPT. These new features empower the chatbot to perceive (comprehend images), listen (comprehend speech), and speak during interactions with users.

Converse with ChatGPT

By harnessing the power of Whisper, users can engage in seamless back-and-forth conversations using their voices.

The text-to-speech model boasts a selection of five distinct voices, collaboratively crafted alongside professional voice actors (demonstration).

Engage in Conversations with Images:

ChatGPT’s language comprehension abilities have been expanded to encompass images, photographs, screenshots, and textual documents.

Users can discuss multiple images or even utilize the innovative drawing tool to guide the assistant (demonstration).

Additional Insights

The newly introduced text-to-speech model is already in use within Spotify’s Voice Translation feature pilot, facilitating the translation of podcast audio.

OpenAI is gradually rolling out voice and image capabilities over the next two weeks for both Plus and Enterprise users.

Voice functionality will soon be accessible on both iOS and Android, while image support will be available across all platforms.

Why It’s Significant: This multimodal advancement represents a monumental stride forward for Language Models. OpenAI has achieved this milestone ahead of Google’s Gemini launch, and it brings us closer to the voice assistant experience many of us have longed for, akin to Siri’s capabilities.

ChatGPT now has the capability to “comprehend” uploaded images, whether they are screenshots, photographs, documents, or other visual content. This feature proves exceptionally valuable when encountering situations where you’ve seen something but require comprehension.

It has various practical applications, such as assisting in tasks like repairing a broken bicycle (demonstration), or engaging in fun activities like locating Waldo. Nevertheless, what truly excites us are the business applications:

  1. Interpreting Complex Graphs and Data Visualizations: ChatGPT can assist in deciphering intricate charts and data visualizations.

  2. Providing Feedback on Designs or User Experience: It can offer valuable insights and feedback on design concepts and user experiences.

  3. Categorizing Receipts and Expenses Images: ChatGPT can categorize images of receipts and expenses, streamlining financial organization.

In addition to its image understanding capabilities, ChatGPT can now “speak” as well. Instead of solely responding in text, it can utilize five distinct voices trained by professional voice actors. This functionality is particularly convenient for situations where listening is preferable to reading, such as when you’re briskly walking on a treadmill. Furthermore, we anticipate witnessing the integration of smarter voice assistants into your work routines in the near future, potentially surpassing existing voice assistants like Siri.

The significance of these developments lies in the realization of multimodal AI, a concept hinted at by OpenAI in March. However, it’s worth noting that with greater power comes greater responsibility, as exemplified by OpenAI’s commitment to ethical use. Expect ChatGPT not only to offer more capabilities but also to exercise discretion and say “no” more frequently to prevent misuse of these new features.

Both of these features will be gradually rolled out to Plus and Enterprise users over the next two weeks, and you can enable them through the Settings menu under “New Features.”

