GPT-4o：文本语音视频交互，一“模”到底

ChatGPT人工智能2024-05-24 11:36:26522

先做个广告：如需代注册ChatGPT或充值 GPT5会员（plus），请添加站长微信：gptchongzhi

今天分享的文章选自《麻省理工技术评论》。在人工智能的浪潮中，OpenAI再次引领了创新的步伐，推出了全新的GPT-4o模型。这款模型不仅代表着技术的飞跃，更将人机交互推向了新的高度。现在，你不再需要多个模型或工具来与机器进行交互，GPT-4o让你能够在同一模型中轻松使用语音或视频，与AI进行自然、流畅的对话。想象一下，你只需简单地说出你的需求，GPT-4o就能立即理解并给出回应；或者，你通过视频展示一个复杂的场景，GPT-4o能够迅速分析并给出解决方案。这样的交互方式不仅提升了效率，更让AI变得更加“懂你”。GPT-4o的强大功能得益于其“端到端多模态大模型”的设计，它支持文本、音频和图像的任意组合输入与输出，让机器能够更全面地理解人类的需求。同时，GPT-4o的反应速度也得到了质的提升，几乎可以与人类在对话中的反应速度相媲美。让我们一起期待GPT-4o在更多领域展现出其强大的应用潜力吧！

OpenAI’s new GPT-4o lets people interact using voice or video in the same model

推荐使用GPT中文版,国内可直接访问：https://ai.gpt86.top

GPT-4o：语音视频交互，一“模”到底

1

OpenAI just debuted GPT-4o, a new kind of AI model that you can communicate with in real time via live voice conversation, video streams from your phone, and text. The model is rolling out over the next few weeks and will be free for all users through both the GPT app and the web interface, according to the company.OpenAI CTO Mira Murati led the live demonstration of the new release one day before Google is expected to unveil its own AI advancements at its flagship I/O conference on Tuesday, May 14.

debut v.首次亮相

demonstration n. 演示，展示

flagship n.旗舰，最重要的产品

OpenAI刚刚发布了GPT-4o，这是一种新型的人工智能模型，用户可以通过实时语音对话、手机视频流和文本与其进行实时交流。据公司透露，该模型将在未来几周内逐步推出，并将通过GPT应用程序和网页界面向所有用户免费提供。OpenAI的首席技术官米拉·穆拉蒂（Mira Murati）在谷歌预计于5月14日星期二举行的旗舰I/O大会上发布自己的AI进展的前一天，进行了新产品的现场演示。

GPT-4 offered similar capabilities, giving users multiple ways to interact with OpenAI’s AI offerings. But it siloed them in separate models, leading to longer response times and presumably higher computing costs. GPT-4o has now merged those capabilities into a single model, which Murati called an “omnimodel.” That means faster responses and smoother transitions between tasks, she said.“We’re looking at the future of interaction between ourselves and the machines,” Murati said of the demo. “We think that GPT-4o is really shifting that paradigm into the future of collaboration, where this interaction becomes much more natural.”

silo n.简仓；地下贮存库

paradigm n.范式；样式

点击下方查看翻译

GPT-4提供了类似的功能，为用户提供了多种与OpenAI人工智能产品交互的方式。但它将它们隔离在不同的模型中，导致更长的响应时间和可能更高的计算成本。gpt - 4o现在将这些功能合并到一个模型中，Murati称之为“全能模型”。她说，这意味着更快的反应和任务之间更平稳的过渡。穆拉蒂在谈到演示时说:“我们正在研究人类与机器之间互动的未来。”“我们认为gpt -4o确实将这种模式转变为未来的合作，这种互动变得更加自然。”

Like previous generations of GPT, GPT-4o will store records of users’ interactions with it, meaning the model “has a sense of continuity across all your conversations,” according to Murati. Other new highlights include live translation, the ability to search through your conversations with the model, and the power to look up information in real time.As is the nature of a live demo, there were hiccups and glitches. GPT-4o’s voice might jump in awkwardly during the conversation. It appeared to comment on one of the presenters’ outfits even though it wasn’t asked to. But it recovered well when the demonstrators told the model it had erred. It seems to be able to respond quickly and helpfully across several mediums that other models have not yet merged as effectively.

hiccups and glitches 小故障

outfit n.套装；团队；全套装备

demonstrator n.示范者

与前几代GPT一样，GPT- 4o将存储用户与它的交互记录，这意味着该模型“在你所有的对话中都有一种连续性，”穆拉蒂说。其他新的亮点包括实时翻译，搜索与模型的对话的能力，以及实时查找信息的能力。作为一个现场演示的本质，有一些小问题和小故障。gpt - 4o的声音可能会在谈话中尴尬地插进来。它似乎对一位主持人的服装发表了评论，尽管它没有被要求这么做。但当示范者告诉模型它犯了错误时，它改善得很好。它似乎能够跨多种媒介做出快速而有益的响应，而其他模型尚未有效地融合。

Prior to GPT-4o, you could use Voice Mode to talk to ChatGPT with latencies of 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4) on average. To achieve this, Voice Mode is a pipeline of three separate models: one simple model transcribes audio to text, GPT-3.5 or GPT-4 takes in text and outputs text, and a third simple model converts that text back to audio. This process means that the main source of intelligence, GPT-4, loses a lot of information—it can’t directly observe tone, multiple speakers, or background noises, and it can’t output laughter, singing, or express emotion.‍

latency n.潜伏；潜在因素

tone n. 语气，腔调

点击下方查看翻译

在gpt - 4o之前，您可以使用语音模式与ChatGPT通话，平均延迟为2.8秒(GPT-3.5)和5.4秒(GPT-4)。为了实现这一点，Voice Mode是一个由三个独立模型组成:一个简单模型将音频转录为文本，GPT-3.5或GPT-4接收文本并输出文本，第三个简单模型将文本转换回音频。这个过程意味着智力的主要来源GPT-4会丢失大量信息——它不能直接观察音调、多个说话者或背景噪音，也不能输出笑声、歌声或表达情感。

END

写作句式积累

Because GPT-4o is our first model combining all of these modalities, we are still just scratching the surface of exploring what the model can do and its limitations.

scratching the surface of exploring ... 浅尝辄止地探索...

由于 GPT-4o 是我们第一个结合所有这些模式的模型，因此我们仍然只是浅尝辄止地探索该模型的功能及其局限性。

翻译练习

GPT-4o (“o” for “omni”) is a step towards much more natural human-computer interaction—it accepts as input any combination of text, audio, image, and video and generates any combination of text, audio, and image outputs.

chatgpt plus(GPT4)代充值