China & Generative AI

A player in the the making?

陆奇最新演讲实录:我的大模型世界观.pdf

Lu Qi speech summarized in English 

Lu Qi, the founder and CEO of Qiji Chuangtan, recently expressed his inability to keep up with the rapid pace of the big model era. He emphasized the importance of staying updated on papers and information and encouraged his subordinates to engage in "big model daily" activities. Lu Qi highlighted the challenges of managing the overwhelming amount of papers and codes in the field. He stressed the significance of the large model era and its impact on society. Lu Qi also shared his perspective on the Trinity Structure Evolution Model, which explains the interplay between information, model, and action systems in complex digital systems. He discussed how the transition from marginal to fixed costs in obtaining information has driven significant changes in society and industry. Lu Qi identified the large model as the core technology and basis of industrialization in the current inflection point. He emphasized that the large model revolution will impact all individuals, particularly those in the service economy, and underscored the importance of unique insights in the future. Overall, Lu Qi aimed to provide guidance to Chinese entrepreneurs, helping them understand the era of large models and find their position in it.

Lu Qi discusses three inflection points he sees in the future. The first inflection point is the ubiquity of models. He predicts that in the next 15-20 years, models will be integrated into various aspects of life and accessible through mobile phones and internet connections. The second inflection point is the widespread presence of automated and autonomous actions. The cost of physical actions will transition from marginal to fixed, leading to the integration of robotics and spatial computing. The third inflection point relates to the co-evolution of humans and digital technologies. OpenAI emphasizes the need for the emergence, agency, affordance, and embodiment of general intelligence. Lu Qi believes that OpenAI, with its groundbreaking work on large models, is positioned to lead in these inflection points and may surpass companies like Google in the future. He highlights OpenAI's key technologies and the importance of model architecture, specifically the Transformer model. He praises OpenAI's alignment engineering efforts to ensure models align with human values. Lu Qi acknowledges the challenges of developing large models, particularly the need for advanced infrastructure and the importance of tokens in language processing. He emphasizes that the large-scale encapsulation of knowledge within models is a significant breakthrough in natural language processing. The inflection point has arrived, and the paradigm shift is underway


In the future, models will be pervasive and integrated into various aspects of life. OpenAI's focus in the next 2-3 years is to make models more sparse, extend the attention window, and enhance recursion causality reasoning. They aim to improve model stability, token space, latent space alignment, and infrastructure tools. The influx of capital has set the growth flywheel in motion, with increasing investments and a growing business and profit model. However, OpenAI faces challenges in terms of social security and regulation, and efforts are being made to slow down the pace of advancement and ensure user feedback and risk mitigation. The development path revolves around the extensibility of models and the future model ecosystem. Lu Qi envisions more large models with complete world knowledge and improved learning, generalization, and alignment capabilities. Different types of models will be built, including domain-specific models and human models encompassing cognitive, task, and professional aspects. Lu Qi highlights the differences between human models and learned models, emphasizing the adaptability and scene-based nature of learned models while acknowledging the professionalism and limitations of human models. The future will see an ecosystem of models, enabling advancements in cognition and reasoning. Ultimately, grounding and the integration of perception and action are crucial for true intelligence. Lu Qi draws parallels between the model world and the biological world, viewing large models as genes that evolve and drive progress. Understanding the paradigm of this era, the focus shifts to embracing and leveraging the opportunities it presents.


The pace of development in the big model era is incredibly fast, with frequent "holy shit" moments.


The entrepreneurial opportunities in the big model era can be understood using a structured thinking framework.







www.ft.com/content/6d6515b1-051a-40ac-b8e9-d9729241f589?emailId=1c30fb37-6545-46d3-8930-a718cf3d9e92&segmentId=60a126e8-df3c-b524-c979-f90bde8a67cd 

“The market is no longer available for lossmaking start-ups even though it is designed for them,” said James Li, a Shenzhen-based investment banker who has worked on IPOs on the technology board.

Alibaba, Tencent and Baidu join the ChatGPT rush - Nikkei Asia.pdf
ChatGPT rush kicks U.S.-China AI race into higher gear - Nikkei Asia.pdf
Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial ScenariosDue to the complex attention mechanisms and model design, most existing vision Transformers (ViTs) can not perform as efficiently as convolutional neural networks (CNNs) in realistic industrial deployment scenarios, e.g. TensorRT and CoreML. This poses a distinct challenge: Can a visual neural network be designed to infer as fast as CNNs and perform as powerful as ViTs? Recent works have tried to design CNN-Transformer hybrid architectures to address this issue, yet the overall performance of these works is far away from satisfactory. To end these, we propose a next generation vision Transformer for efficient deployment in realistic industrial scenarios, namely Next-ViT, which dominates both CNNs and ViTs from the perspective of latency/accuracy trade-off. In this work, the Next Convolution Block (NCB) and Next Transformer Block (NTB) are respectively developed to capture local and global information with deployment-friendly mechanisms. Then, Next Hybrid Strategy (NHS) is designed to stack NCB and NTB in an efficient hybrid paradigm, which boosts performance in various downstream tasks. Extensive experiments show that Next-ViT significantly outperforms existing CNNs, ViTs and CNN-Transformer hybrid architectures with respect to the latency/accuracy trade-off across various vision tasks. On TensorRT, Next-ViT surpasses ResNet by 5.5 mAP (from 40.4 to 45.9) on COCO detection and 7.7% mIoU (from 38.8% to 46.5%) on ADE20K segmentation under similar latency. Meanwhile, it achieves comparable performance with CSWin, while the inference speed is accelerated by 3.6x. On CoreML, Next-ViT surpasses EfficientFormer by 4.6 mAP (from 42.6 to 47.2) on COCO detection and 3.5% mIoU (from 45.1% to 48.6%) on ADE20K segmentation under similar latency. Our code and models are made public at: https://github.com/bytedance/Next-ViT
Jointist: Simultaneous Improvement of Multi-instrument Transcription and Music Source Separation via Joint TrainingIn this paper, we introduce Jointist, an instrument-aware multi-instrument framework that is capable of transcribing, recognizing, and separating multiple musical instruments from an audio clip. Jointist consists of an instrument recognition module that conditions the other two modules: a transcription module that outputs instrument-specific piano rolls, and a source separation module that utilizes instrument information and transcription results. The joint training of the transcription and source separation modules serves to improve the performance of both tasks. The instrument module is optional and can be directly controlled by human users. This makes Jointist a flexible user-controllable framework. Our challenging problem formulation makes the model highly useful in the real world given that modern popular music typically consists of multiple instruments. Its novelty, however, necessitates a new perspective on how to evaluate such a model. In our experiments, we assess the proposed model from various aspects, providing a new evaluation perspective for multi-instrument transcription. Our subjective listening study shows that Jointist achieves state-of-the-art performance on popular music, outperforming existing multi-instrument transcription models such as MT3. We conducted experiments on several downstream tasks and found that the proposed method improved transcription by more than 1 percentage points (ppt.), source separation by 5 SDR, downbeat detection by 1.8 ppt., chord recognition by 1.4 ppt., and key estimation by 1.4 ppt., when utilizing transcription results obtained from Jointist. Demo available at \url{https://jointist.github.io/Demo}.