Goel emphasizes the importance of new model architectures in building truly useful AI models. In the highly competitive AI industry, both in the commercial and open source sectors, having the best model is crucial for success. Before joining Cartesia, Goel was a Ph.D. candidate in Stanford's AI lab, where he collaborated with Christopher Ré and others. During this time, he and Albert Gu sketched out the SSM concept.
Goel then took jobs at Snorkel AI and Salesforce, while Gu became an assistant professor at Carnegie Mellon. However, they continued to study SSMs and published several significant research papers. In 2023, Gu, Goel, and their former Stanford peers Arjun Desai and Brandon Yang joined forces to launch Cartesia and commercialize their research.
Most AI apps today use transformer architectures. While transformers are powerful due to their hidden state mechanism, they are also inefficient. To refer to even a single word about previously ingested data, a transformer has to scan through its entire hidden state, which is as computationally demanding as rereading the whole book. In contrast, SSMs compress previous data points into summaries and update the state as new data comes in, discarding most previous data.
This allows SSMs to handle large amounts of data while outperforming transformers on certain data generation tasks. With inference costs on the rise, this is a highly attractive proposition.
However, Cartesia has faced ethical challenges. They trained some of their SSMs on The Pile, an open data set containing unlicensed copyrighted books. Although many AI companies argue that fair-use doctrine protects them from infringement claims, authors have sued Meta and Microsoft for using The Pile. Cartesia also has few safeguards for its Sonic-powered voice cloner. I was able to create a clone of former vice president Kamala Harris' voice using campaign speeches. Cartesia's tool only requires users to check a box indicating compliance with their terms of service.
Goel acknowledges the issue and says that Cartesia has automated and manual review systems in place and is working on voice verification and watermarking. They also have dedicated teams testing for technical performance, misuse, and bias and are establishing partnerships with external auditors for independent model verification.
By default, Cartesia uses customer data to train its models, which may not sit well with privacy-conscious users. However, users can opt out if they wish, and Cartesia offers custom retention policies for larger organizations. Goodcall CEO Bob Summers chose Sonic because it had the lowest latency of 90 milliseconds compared to other voice generation models.
Sonic is currently used in gaming, voice dubbing, and more. Goel believes that this is just the beginning of what SSMs can do. His vision is to create models that can run on any device and understand and generate any data modality instantly. To achieve this, Cartesia launched Sonic On-Device, a version optimized for mobile devices, and Edge, a software library for optimizing SSMs for different hardware configurations, along with Rene, a compact language model.
Cartesia faces the challenge of convincing potential clients of the value of their architecture and staying ahead of competitors. Startups like Zephyra, Mistral, and AI21 Labs have trained hybrid Mamba-based models, and Liquid AI is developing its own architecture. However, Goel is confident that Cartesia, with its 26 employees and a new cash infusion, is positioned for success.
Shardul Shah of Index Ventures sees Cartesia's technology driving applications in customer service, sales and marketing, robotics, security, and more. The market demands faster and more efficient models that can run anywhere, and Cartesia's technology is well-suited to meet this demand and drive the next wave of AI innovation.