Software
Cartesia's AI: Efficient for Anywhere with State Space Models
2024-12-12
In the ever-evolving landscape of artificial intelligence, the pursuit of cost-effective solutions has become a top priority. OpenAI's AI operations costs are projected to reach an astonishing $7 billion this year, and Anthropic's CEO has hinted at the imminent arrival of models costing over $10 billion. This has sparked a fervent hunt for ways to make AI more affordable.

Revolutionizing AI with State Space Models

Techniques for Cost-Effective AI

Some researchers are concentrating on optimizing existing model architectures, which are the building blocks that make models function. Others are developing new architectures with a better chance of scaling up affordably. Karan Goel, at the startup Cartesia, is part of the latter group. He is working on state space models (SSMs), a novel and highly efficient architecture that can handle vast amounts of data simultaneously.

Goel emphasizes the importance of new model architectures in building truly useful AI models. In the highly competitive AI industry, both in the commercial and open source sectors, having the best model is crucial for success. Before joining Cartesia, Goel was a Ph.D. candidate in Stanford's AI lab, where he collaborated with Christopher Ré and others. During this time, he and Albert Gu sketched out the SSM concept.

Goel then took jobs at Snorkel AI and Salesforce, while Gu became an assistant professor at Carnegie Mellon. However, they continued to study SSMs and published several significant research papers. In 2023, Gu, Goel, and their former Stanford peers Arjun Desai and Brandon Yang joined forces to launch Cartesia and commercialize their research.

Cartesia's SSM Derivatives

Cartesia, whose founding team includes Ré, has developed many derivatives of Mamba, the most popular SSM today. Gu and Princeton professor Tri Dao started Mamba as an open research project last December and have been continuously refining it. Cartesia builds on Mamba and also trains its own SSMs. Like all SSMs, Cartesia's models have a working memory, making them faster and potentially more efficient in utilizing computing power.

Most AI apps today use transformer architectures. While transformers are powerful due to their hidden state mechanism, they are also inefficient. To refer to even a single word about previously ingested data, a transformer has to scan through its entire hidden state, which is as computationally demanding as rereading the whole book. In contrast, SSMs compress previous data points into summaries and update the state as new data comes in, discarding most previous data.

This allows SSMs to handle large amounts of data while outperforming transformers on certain data generation tasks. With inference costs on the rise, this is a highly attractive proposition.

Ethical Considerations

Cartesia operates as a community research lab, collaborating with external organizations and developing SSMs in-house. Sonic, their latest project, is an SSM that can clone voices and adjust tones and cadences. Goel claims that Sonic is the fastest model in its class and demonstrates the excellence of SSMs on long-context data like audio while maintaining high performance in stability and accuracy.

However, Cartesia has faced ethical challenges. They trained some of their SSMs on The Pile, an open data set containing unlicensed copyrighted books. Although many AI companies argue that fair-use doctrine protects them from infringement claims, authors have sued Meta and Microsoft for using The Pile. Cartesia also has few safeguards for its Sonic-powered voice cloner. I was able to create a clone of former vice president Kamala Harris' voice using campaign speeches. Cartesia's tool only requires users to check a box indicating compliance with their terms of service.

Goel acknowledges the issue and says that Cartesia has automated and manual review systems in place and is working on voice verification and watermarking. They also have dedicated teams testing for technical performance, misuse, and bias and are establishing partnerships with external auditors for independent model verification.

Budding Business and Revenue

Goel states that hundreds of customers are paying for Sonic API access, which is Cartesia's primary source of revenue. Automated calling app Goodcall is among the customers. Cartesia's API is free for up to 100,000 characters read aloud, and the most expensive plan costs $299 per month for 8 million characters. They also offer an enterprise tier with dedicated support and custom limits.

By default, Cartesia uses customer data to train its models, which may not sit well with privacy-conscious users. However, users can opt out if they wish, and Cartesia offers custom retention policies for larger organizations. Goodcall CEO Bob Summers chose Sonic because it had the lowest latency of 90 milliseconds compared to other voice generation models.

Sonic is currently used in gaming, voice dubbing, and more. Goel believes that this is just the beginning of what SSMs can do. His vision is to create models that can run on any device and understand and generate any data modality instantly. To achieve this, Cartesia launched Sonic On-Device, a version optimized for mobile devices, and Edge, a software library for optimizing SSMs for different hardware configurations, along with Rene, a compact language model.

Cartesia faces the challenge of convincing potential clients of the value of their architecture and staying ahead of competitors. Startups like Zephyra, Mistral, and AI21 Labs have trained hybrid Mamba-based models, and Liquid AI is developing its own architecture. However, Goel is confident that Cartesia, with its 26 employees and a new cash infusion, is positioned for success.

Shardul Shah of Index Ventures sees Cartesia's technology driving applications in customer service, sales and marketing, robotics, security, and more. The market demands faster and more efficient models that can run anywhere, and Cartesia's technology is well-suited to meet this demand and drive the next wave of AI innovation.

More Stories
see more