Leveraging NLP for alpha extraction in financial markets

March 16, 2026

Text-based data has long been a valuable source of alpha for both discretionary and systematic traders, but a persistent tension has existed between the two camps.

Humans have a natural advantage when it comes to interpreting nuanced language, while machines can process information at speeds that are simply beyond human capacity. Research from LSEG Data & Analytics highlights how recent advances in natural language processing (NLP) and high-performance computing are not only widening the processing speed advantage held by systematic traders, but also dramatically narrowing the gap between human and machine comprehension of text data.

The pace of progress in deep learning has been nothing short of remarkable. Artificial, convolutional, and recurrent neural networks have become standard tools in many high-performing funds, consistently outpacing traditional statistical and machine learning approaches. At a broader level, large language models (LLMs) have moved to the forefront across a wide range of industries. Generative models — with GPT leading the charge — are being rolled out at an unprecedented rate, while discriminative LLMs have made equally significant, if less publicised, strides.

Google’s BERT model has established itself as a leading transformer architecture for sentiment analysis, it said. According to LSEG Data & Analytics, BERT has demonstrated double-digit performance improvements over previous state-of-the-art models in the General Language Understanding Evaluation (GLUE) benchmark — a significant leap that opens new doors for systematic traders seeking to extract greater value from text data.

Crucially, beyond its strong baseline performance, BERT can be fine-tuned using relatively modest amounts of labelled data, making it highly adaptable to specialised domains with technical jargon or non-standard language use.

From an execution standpoint, accessibility is one of BERT’s most compelling attributes. Training an LLM from scratch is an enormous undertaking — the base BERT model alone carries 110 million trainable parameters, requiring vast quantities of data to develop effectively. However, because many of these models are open-sourced, traders and institutions can instead focus on fine-tuning pre-existing models for their specific use cases, considerably lowering the barrier to adoption.

LSEG Data & Analytics points to the practicalities of deploying these models in live data pipelines. Using Hugging Face’s FinBERT, a single CPU thread running at 2.3GHz can process approximately 20 pieces of text per second in a base configuration. Switching to a faster tokeniser — achievable with a single line of code change in Python — yields roughly a 74% improvement in throughput. Moving to GPU infrastructure takes this further still: a 9.1 TFLOP GPU pushes processing capacity to around 261 predictions per second, more than a tenfold increase over the CPU baseline. Cloud providers today offer compute power that exceeds even this level of performance.

For more insights, read the full story here.

Read the daily FinTech news