28 Sep, 2022

An introduction to AI-driven portfolio construction


In this blog, our summer intern, Luisa Jung provides a high-level introduction on how we incorporate AI into the construction of investment portfolios on our technology platform, AutoCIO, which delivers active, customisable investment strategies at the click of a button.  

Having grown up with technology for as long as I can remember, Artificial Intelligence (AI) has always been an abstract and, admittedly, quite intimidating concept for me. However, my confusion around AI disappeared within my first few weeks at Arabesque. I spoke with the Researchers and Engineers on the AI team and soon had my ‘lightbulb moment’ when I realised that it was the obvious next step for the AI industry to apply itself to financial markets given their incredibly complex and interconnected nature. 

Our web application, AutoCIO, enables clients to create active, hyper-customised, and sustainable investment strategies by choosing from hundreds of user-defined parameters, making it an extremely scalable and cost-effective process towards creating strategies. As you read this blog post, I hope to demystify our use of AI in the strategy construction process and contribute to your understanding of AutoCIO.  

What is the AI Engine?

AutoCIO is powered by our AI Engine, which analyses fundamental behaviours and structures of financial market and evaluates over 25,000 stocks daily with an expectation of expected return. This creates a data foundation that enables the potential output of millions of investment strategies. At its core, the AI Engine consists of an ensemble of machine learning models that analyse a vast amount of financial and non-financial data to distinguish how this information may impact the future return of an asset. (Please read the dedicated blog post from September 2021 to obtain a more detailed explanation.)

Figure 1: AI Engine Input Dataset examples

Portfolio Creation with AutoCIO

Reducing management fees to stay competitive, whilst addressing changing client needs is a major challenge for the asset management industry. To combat the challenges the industry faces, we embrace the application of AI in the portfolio construction process. AutoCIO has the capability of creating traditional, active investment strategies at an enormous scale, making the product agile to the needs of portfolio managers. Millions of potential investment strategies can be simulated in less than two hours for a fraction of the normal research cost. The AI Engine ensures that AutoCIO is continuously adapting to new market trends and behaviours.

Let’s now look at how AutoCIO works: once the AI Engine has analysed the big data input and calculated an expectation of stock return for all the 25,000+ assets, AutoCIO then implements the portfolio construction process. This process consists of three interlinked classes:¹ Rebalance, Optimiser and Backtester.

Figure 2: AutoCIO Strategy Construction 


AutoCIO portfolio construction begins with the Rebalance class, which is essentially the creation of an investible universe – a group of stocks that are considered investible, meaning that they would potentially satisfy a minimal set of conditions or criteria such as liquidity, size etc. How do we create a universe? First, market filters are applied:² Besides selecting the geography (i.e. adding or removing individual countries to be a part of the strategic universe), we can include/exclude certain sectors or industries.

Now that we have created the skeleton of the strategy, we can apply additional filters to further define the universe of stocks. For example, we could filter on liquidity, define the exposure of the strategy by setting the market cap, and/or filter on stock activity. Lastly, we may exclude certain stocks by implementing certain ESG or other style preferences.

By selecting these filters, the included stocks can then be further optimised in subsequent steps. Besides constructing the strategic universe, the rebalance class acts as interface, or a sort of ‘middleman’, between the Optimiser and the Backtester classes, cleaning the data it receives from both sides.


Within the Optimiser class, we decide on the most optimal portfolio within the investible universe at each rebalancing. The objective of the Optimiser class is to choose a portfolio of stocks that maximises a function which considers returns, transaction costs, and an investor’s risk-profile subject to a number of constraints. The first component of the function is the portfolio’s alpha assessment – determined by the AI Engine. The second and third components are portfolio risk and cost appetite, respectively. Risk in AutoCIO is calculated as portfolio variance. For costs, a combination of parameters including trading commission, bid-ask spread, and market impact costs are considered. The coefficients of the equation for risk and cost aversion are negative to reflect that they detract from the alpha component of the portfolio being maximised. The weights for all the stocks in the portfolio (referred to as wi in the equation) correspond with the forecasted alpha, risk and cost components of the function.
Lastly, the Optimiser implements constraints; for example, we may implement a strategy constraint, like long-only (wi ≥ 0), or a risk constraint like maximum position size (wi ≤ max. position size). The magnitude of the coefficients for each of the terms within the function, as well as the accompanying constraints are fully customizable to ensure that each investors objective is being met.

The following equation explains the goal of the Optimiser: 


Once the strategy has been constructed, the Backtester simulates the trades. The Backtester is a historical simulation that replicates how the strategy would have performed historically. AutoCIO can run a wide array of strategies at different frequencies in different regions and markets. Once the backtesting is complete, the strategy is available for clients via the AutoCIO web application. In addition to the client’s customised portfolio strategy, the user interface also offers the possibility to view AutoCIO’s off-the-shelf strategies. The web application allows clients to analyse a strategy using statistics and factor attribution.

Figure 3: Interaction between the classes

Wrapping Up

I hope this blog post has made it clear that AI can be a solution to three key challenges faced by the asset management industry: customisation, cost, and performance. AI enables and facilitates the customisation process by quickly implementing preferences, reducing effort and time. This leads us to the next challenge which AutoCIO solves – cost. Traditional, active investment strategies are normally cost intensive, as they require an ‘army’ of analysts to conduct significant research and due diligence. AutoCIO allows investors to bypass these lengthy and costly processes. Lastly, performance is accounted for by our AI Engine which utilises a vast array of data and the latest advancements in machine learning research to find return opportunities in markets. As our research and the review of our created strategies have shown, strategies created on AutoCIO generally outperform their respective benchmark over 80% of the time and with less volatility.³

AutoCIO provides an opportunity for institutional investors to generate funds at scale with the computational power to analyse high volumes of financial and non-financial data to service a growing demand for customisation and values-based strategies. It’s no longer a case of ‘if’ or ‘when’ AI will enter investment functions. The opportunity for autonomous investing is here now.

Here are some key takeaways about AutoCIO:

  1. Arabesque’s AI Engine produces an expected return assessment for over 25,000 stocks on a daily basis.
  2. Rebalancing allows AutoCIO to consider a large range of preferences and exclusion criteria to construct an eligible universe of stocks.
  3. The Optimiser, based on the AI Engine outputs received, optimises the portfolio from the rebalanced universe to achieve a client’s investment objective.
  4. The Backtester simulates the trades and runs a historic simulation and shows how the stocks would have performed historically.
  5. The use of AI in asset management allows for a more comprehensive analysis of investment strategies and is more agile to consumer/market demands.

¹ In coding, many languages use the concept of objects and classes. Objects are a construct enclosing data and functions. Classes provide the structure for building these objects.

² Instead of selecting from a range of optional filters, it is also possible to select a particular predefined benchmark.

³ 1,200 funds generated by autonomous asset management platform AutoCIO delivered an average of 1.86 percentage points (pp) in excess returns and a 2.89pp reduction in volatility compared with equivalent benchmarks over a ten-year period. 87% of these portfolios outperformed their benchmarks. This effect while stronger in some regions, holds across all regions.

27 Jul, 2022

An Eye Opening Experience: Reflections from the Conference on Computer Vision and Pattern Recognition


This year saw the return of in-person research conferences, and the IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR) was no exception. The event took place in New Orleans, Louisiana from June 19th – 24th (Figure 1). It hosted thousands of researchers from leading academic and industrial institutions (think Google, Meta, Apple, Tesla etc.), and I was fortunate enough to attend the event myself.  
Over the years, computer vision has made an insurmountable contribution to Machine Learning (ML) and Artificial Intelligence (AI). From feature detection [1] (analysing an image and decomposing it into “simpler” features, e.g., edges or shapes), Convolutional Neural Networks 2 (neural networks that rely on correlations, usually measured in spatial dimensions, between “nearby” points in an input by applying convolutional filters across the input space), to autonomous driving [3] (self-driving vehicles). 
With this in mind and considering that CVPR has the highest h-index (a measure of the impact of a conference, based on some function of citation counts) of any AI-related conference, there’s no doubt that the output of this year’s event will move AI forward. In this blog, I highlight what I think are some of the most interesting papers and ideas to come out of this year’s event and their potential implications for AI and ML in general. 

Figure 1: CVPR in the flesh! 

Hyperbolic deep learning 

In [4], hyperbolic deep learning is used to classify what will happen next in a video clip, given the previous clip’s frames.  
The details of how embedding the training of a machine learning model on a hyperbolic geometry are far beyond the scope of this blog post. However, the essence of the problem is that the model attempts to learn a hierarchy of the possible classifications (outputs, that is, what the next video frame will depict) of a problem and uses the curved surface (Figure 2) of a hyperbolic space to represent these hierarchies and to quantify the uncertainty of a classification. Different levels of the hierarchy are meant to represent the certainty of the model’s prediction. 
What this means for AI in general: The idea of training predictive models on a hyperbolic space is completely general to any supervised machine learning task (scenarios where ML is used to generate predictions), and so it will be interesting to see how the concept plays out in other fields. 
On a side note, it’s great to see geometry playing a larger part in AI, akin to how Einstein began to form his theory of General Relativity, based on curved spaces and geodesics. 

Figure 2: Representing prediction uncertainty on a hyperbolic space1. Image taken from the original paper, with permission 

Explainable AI 

The fact that explainable AI received a large amount of coverage at CVPR is somewhat telling: it is an area of growing popularity and importance in CV. 
The gist of the papers is that the output of a classification algorithm is explained in terms of the inputs to the predictive model (usually a CNN). Of course, in most contexts applicable to these papers, the inputs are images, and so the explanations are formed around “groups” of the inputs, pertaining to features detected in the images by the model (e.g. whiskers on a cat). Another common theme in the explainability papers unsurprisingly was the notion of attention [5] which gives some indication of what inputs “attained more attention” to generate the different outputs. Furthermore, the fact that more ready-to-use open-source tools are being made available rather than just being presented as POCs2, is a breath of fresh air, and hopefully takes some of the weight off the few existing easily-deployable packages such as LIME [6] and SHAP [7]. 
One paper in particular that caught my eye was [8]. In this paper, instead of trying to explain a model’s prediction using some approximate method after the prediction has been generated, the network’s mathematical formulation is adjusted a priori so that a by-product of the output of the model provides a direct explanation for the prediction in terms of the model’s inputs.  

Figure 3: Dissecting the informative parts of an input image using [8]. Image taken from the original paper, with permission 

What this means for AI in general: In a field where explanations were not considered pivotal for a long time, it is refreshing to see the topic start to take more of a leading role. Especially as computer vision is applied more and more to real-world scenarios such as medical imaging, and self-driving vehicles, it is crucial that we understand why the blackbox algorithms make the decisions they do. I am confident that this further adoption of explainable AI will continue across different fields in the short-term. 

Domain generalisation 

Another hot topic in CV currently is domain generalisation. It is well-known that deep learning models perform well during live production when the examples considered in this period are drawn from the same data distribution as what the models were trained on. However, when the models are applied to data distributions different compared to what they were trained on, they often don’t perform as well, as the “patterns” they learned during training and don’t persist for the out of distribution data.  
Some interesting papers from CVPR which attempt to address this issue include [9]. In this paper, with the aid of prior knowledge, causal features are learned which are considered to be invariant even across different data distributions. For example, when trying to classify images of different animals, the causal invariant feature would be the profile of the animal, irrespective of the form of the background, which is instead considered as spurious features. 

Figure 4: Causal invariant features (profile of the cow) versus spurious features (image background). Image taken from the original paper, with permission 

 Another paper on this topic from CVPR is [10]. The paper adapts a boosting technique (fitting models to the residuals of the errors of other models’ predictions), applied as an add-on to any deep learning model (Figure 5). In order to improve generalisability, the boosted models are applied to train-cross validation splits of the data during model training, and on a subset of both the most informative and non-informative internal features of the original neural network. During inference, the boosted model used is chosen based on the discrepancy of the input data with the different classes of data used during training. This is measured using a Siamese network (a deep learning model which measures the discrepancy between two sets of inputs). The concept of boosting extends as far back as [11] and is a prominent concept in decision tree-based models. 

Figure 5: Schematic of the BoosterNet add-on for a deep learning model. Image taken from the original paper, with permission 

What this means for AI in general: More focus on domain generalisation is crucial for any scenarios subject to rapidly changing environments, including finance, autonomous driving or weather/natural disaster predicting. 

Tail, few-shot and imbalanced data learning 

Imbalanced datasets (usually pertaining to the target outputs of the model) are a prominent theme in CV. Think, for example, performing facial recognition on an individual, based on a model trained with only one photo of this individual. Another example in the context of image classification is if a classifier has been trained on a particular set of animals to classify cats and dogs, but then needs to be used to classify another breed of animal (e.g. horses). This extreme case of data imbalance is referred to as zero-shot learning [14]. 
On this theme, an interesting paper from CVPR is [15]. In this paper, a method is proposed for optimally selecting pre-trained models to be applied to a new dataset even if the new data contains classes of images not contained in the pre-trained models, or vice versa. Since fine-tuning all potential pre-trained models on the new data is time-consuming, the paper derives a transferability metric (Figure 6)- an indicator of a model’s performance on the new dataset, which can be calculated by passing the new training data through the candidate models just once. The paper shows that this transferability metric correlates well with true performance on the new test datasets. The final selection of models, as chosen by their transferability metric values, are then fine-tuned (trained) on the training data of the new dataset, and final predictions are generated by ensembling the predictions of these models.

Figure 6: Overview of how [15] proposes to select from an array of pre-trained models using a transferability metric, without having to explicitly train these models on new training data. Image taken from the original paper, with permission 

[16] addresses the problem of imbalanced data in a regression (continuous output) setting, (e.g. trying to predict humans’ height). Here the typical loss function used in regression settings (mean squared error) is given a full statistical treatment to derive an adapted loss function which accommodates for any imbalance in data used during model training. The method contains a small additional computational overhead, from the Monte Carlo sampling associated with computing, known as the marginal likelihood/evidence in Bayesian statistics. 
A simple but effective method of dealing with imbalanced datasets is presented in [17]. The authors argue that the weights associated with the last layer of a classification model are imbalanced in norm according to the associated class imbalances (Figure 7). That is, the norm of the weights associated with the components of the probability outputs, in-turn associated with classes which are highly represented in the training data, are much larger than those of the weights associated with underrepresented classes. The authors go on to show that simple regularisation of these weights associated with the output layer of the classification model can counter the class imbalance, and lead to better performance on test sets which are not subject to such imbalanced data. 

Figure 7: Evolution of the norm (colour of the heatmap) of the model weights associated with different classes in the data. For each plot, the rarer a class is, the higher its positioning on the vertical axis. Image taken from the original paper, with permission  

What does this mean for AI in general: Similar to domain generalisation, research into imbalanced data regimes is crucial for any context which is prone to black swan events. 

Final word 

One final note from the conference was the rise of transformers [18] in CV. It was interesting to see that they are now very much considered the state-of-the-art deep learning model when applied to CV [19], and a lot of active research is being put into vision transformers [20, 21, 22]. However, it is good to see that work on CNNs in CV is still going on [23], and the two are even being combined [24]. This makes sense given the original transformer model uses something akin to a convolutional filter during one of its operations. 


[1] – Lindeberg, T., 1998. Feature detection with automatic scale selection. International journal of computer vision, 30(2), pp.79-116. 
[2] – LeCun, Y. and Bengio, Y., 1995. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks, 3361(10), p.1995. 
[3] – Levinson, J., Askeland, J., Becker, J., Dolson, J., Held, D., Kammel, S., Kolter, J.Z., Langer, D., Pink, O., Pratt, V. and Sokolsky, M., 2011, June. Towards fully autonomous driving: Systems and algorithms. In 2011 IEEE intelligent vehicles symposium (IV) (pp. 163-168). IEEE. 
[4] – Surís, D., Liu, R. and Vondrick, C., 2021. Learning the predictability of the future. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 12607-12617). 
[5] – Bahdanau, D., Cho, K. and Bengio, Y., 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. 
[6] – Ribeiro, M.T., Singh, S. and Guestrin, C., 2016, August. “ Why should i trust you?“ Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining (pp. 1135-1144). 
[7] – Lundberg, S.M. and Lee, S.I., 2017. A unified approach to interpreting model predictions. Advances in neural information processing systems, 30. 
[8] – Böhle, M., Fritz, M. and Schiele, B., 2022. B-cos Networks: Alignment is All We Need for Interpretability. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10329-10338). 
[9] – Wang, R., Yi, M., Chen, Z. and Zhu, S., 2022. Out-of-distribution Generalization with Causal Invariant Transformations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 375-385). 
[10] – Bayasi, N., Hamarneh, G. and Garbi, R., 2022. BoosterNet: Improving Domain Generalization of Deep Neural Nets Using Culpability-Ranked Features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 538-548). 
[11] – Bertinetto, L., Valmadre, J., Henriques, J.F., Vedaldi, A. and Torr, P.H., 2016, October. Fully-convolutional siamese networks for object tracking. In European conference on computer vision (pp. 850-865). Springer, Cham. 
[12] – Drucker, H., Cortes, C., Jackel, L.D., LeCun, Y. and Vapnik, V., 1994. Boosting and other ensemble methods. Neural Computation, 6(6), pp.1289-1301. 
[13] – Breiman, L., Friedman, J.H., Olshen, R.A. and Stone, C.J., 2017. Classification and regression trees. Routledge. 
[14] – Socher, R., Ganjoo, M., Manning, C.D. and Ng, A., 2013. Zero-shot learning through cross-modal transfer. Advances in neural information processing systems, 26. 
[15] – Agostinelli, A., Uijlings, J., Mensink, T. and Ferrari, V., 2022. Transferability Metrics for Selecting Source Model Ensembles. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 7936-7946). 
[16] – Ren, J., Zhang, M., Yu, C. and Liu, Z., 2022. Balanced MSE for Imbalanced Visual Regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 7926-7935). 
[17] – Alshammari, S., Wang, Y.X., Ramanan, D. and Kong, S., 2022. Long-tailed recognition via weight balancing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 6897-6907). 
[18] – Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł. and Polosukhin, I., 2017. Attention is all you need. Advances in neural information processing systems, 30. 
[19] – Khan, S., Naseer, M., Hayat, M., Zamir, S.W., Khan, F.S. and Shah, M., 2021. Transformers in vision: A survey. ACM Computing Surveys (CSUR). 
[20] – Sun, T., Lu, C., Zhang, T. and Ling, H., 2022. Safe Self-Refinement for Transformer-based Domain Adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 7191-7200). 
[21] – Lin, S., Xie, H., Wang, B., Yu, K., Chang, X., Liang, X. and Wang, G., 2022. Knowledge Distillation via the Target-aware Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10915-10924). 
[22] – Yang, C., Wang, Y., Zhang, J., Zhang, H., Wei, Z., Lin, Z. and Yuille, A., 2022. Lite vision transformer with enhanced self-attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 11998-12008). 
[23] – Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T. and Xie, S., 2022. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 11976-11986). 
[24] – Guo, J., Han, K., Wu, H., Tang, Y., Chen, X., Wang, Y. and Xu, C., 2022. Cmt: Convolutional neural networks meet vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 12175-12185).

27 Apr, 2022

Quant’s Place in the ESG World: Engagement at Quant Funds


We are in the middle of annual general meetings season, where companies present their annual reports and shareholders get to vote on key matters. Financial institutions have found themselves at the forefront of sustainability discussions, engaging with corporates and influencing their decision-making on these matters. With good governance, environmental and social topics becoming increasingly prominent around the world, it is in an investor’s interest to oversee how investee companies manage sustainability risks for the benefit of their portfolios and for the benefit of the planet. Engagement with companies is the most effective way to convey change- treating sustainability as a real-world issue rather than merely a portfolio optimisation problem.


The challenge with systematic strategies is that the rules guiding investment decisions are vested in the investment model, and PMs are not taking discretionary decisions. As such, you cannot often apply the traditional engagement tools available to investors, i.e.:

  • Filing a shareholder resolution to escalate matters
  • Vote against management of a company during annual general meetings
  • Divestment, if an engagement with the corporate fails

The reason for this challenge is that the equity ownership required to pursue the above-mentioned approaches means you need to be invested in the company. As a quantitative asset manager, you simply don’t know if you will be invested in the company at the time of their annual general meeting. The market conditions can change, and the company can be dropped from the portfolio as part of the model recalculation. Not to mention the turnover, which is precisely the reason why quant houses are sometimes discouraged from getting involved with engagement initiatives. 

How did we approach this?

Arabesque is built on two pillars: sustainability and AI. These two pillars form the foundation of our investment philosophy and are reflected in the investment process of all our strategies. Our shareholders place their trust with Arabesque to manage their money in line with these principles: by directing capital towards companies that are fit for future, while utilising the latest AI technologies.

As a company with sustainability at the core of our mission and as manager with fiduciary duties, we believe that quantitative houses do have a role to play. We currently utilise the following approaches to integrate stewardship considerations:

  1. Proxy voting: Voting is an obvious place to start but is often dismissed as a tick-boxing exercise. However, not all votes get cast as many can get lost in passive products – leading to the loss of shareholders having a say in company’s decisions. The power of proxy battles should not be underestimated. For example, in 2021, investors voted out 3 of the 12 Exxon Mobile Board Directors due to their unsuitable experience to lead the oil giant in the transition to a low-carbon economy. Therefore, we need the largest shareholders to act and smaller shareholders to voice their concern too. The fund behind the Exxon engagement campaign held only 0.02% of the company but was able to create significant changes at the top management. At Arabesque, we cast votes for companies in all our funds and in line with ESG Voting Policy. Our sustainability team monitors the votes on daily basis and consults the Arabesque Sustainability Committee.
  2. Collaborative engagement: Investor engagement campaigns are likely to be more effective when supported by larger number of financial institutions. Collaborative initiatives provide space for idea sharing, combined analysis and fosters an environment of financial institutions working towards the same goal. The obvious benefit of collaboration is the leverage you gather by combining the AUM of various asset managers and owners. Arabesque is currently part of various collaborative initiatives via ClimateAction100+, Share Action and the Investor Decarbonisation Initiative. We often don’t hold positions in the companies they engage with, but we leverage our brand and provide data for further analysis.
  3. Our engagement campaign: Arabesque launched an engagement campaign focusing on the improvement of GHG emission disclosure in the European technology sector. With net-zero commitments being announced every week, we need data to assess them as what cannot be measured, cannot be managed. Given our expertise in sustainability data, we believe that this is an area, where we can play our part in the world of investor engagement initiatives. Data availability and quality is a key concern to effective performance of the investment strategy and in turn a key concern to our shareholders who trust us with their money. The campaign was supported by investors with $970bn in AUM.
    Further details and methodology behind our campaign will be explored in our next newsletter – stay tuned).

The verdict: do quants have a role to play?

The answer is absolutely. We are aware, that out efforts are only part of a larger puzzle of actions required for the global just transition to low-carbon economy. But the journey is long and complex, so we need all hands-on deck to get going, including the ones of quant managers.

Will Arabesque be able to move the needle? Probably not- but we cannot afford to wait as the risks of inaction is greater.

23 Mrz, 2022

Systematic Credit: accessing another asset class


Systematic credit has drawn more limelight over the years as electronic trading of various credit instruments gained in volume and share of the market. The increased availability of high-quality data and growth of liquidity has made it possible for us at Arabesque AI to consider an expansion into credit instruments.

Both the valuations of corporate credit and equity are dependent on the overall health of a company. Therefore, they share many performance drivers, which we have already implemented into our equity analytical models. This makes it an interesting use-case for us to investigate our models’ transfer-learning capabilities. Through various proofs-of-concept over the past year, we have demonstrated the ability to build analytical models for corporate credit bonds.

Naivety and Challenges

From a theoretical, machine-learning perspective, where we have built a strong pipeline of models for our equity predictions, the application to the credit universe is a simple problem. It can be solved by creating a new dataset and a new label which we then use to train and evaluate a baseline model from our existing pipeline. However, the reality is a lot more challenging, due to 1) heterogeneity of data in clustering and across time, 2) high dataset imbalances, 3) trustworthiness of data and last, but not least, 4) entity and exchange mapping challenges. Let’s briefly look at each of these challenges.

Heterogeneity of data: When we consider equities, we can naively group them by the geographies they trade in, the sectors they conduct business in, etc. Ultimately, most of these instruments are non-preferential shares, otherwise known as common shares. Hence, they are comparable in a way. Corporate credit is awash with little details that makes it hard to compare. Some bonds are callable or puttable, which gives either the issuer the right to redeem the bond before the maturity date, or it gives the holder the right to demand the paying back of the principal amount before the maturity date. In stocks, options are separate financial products and therefore don’t need to be considered in pure stock-price forecasting. Further, the maturity dates of bonds are not aligned, one company can issue various types of bonds, such as secured bonds or convertibles and, to make it even more complicated, some European bonds are eligible for the ECB’s asset purchase programme. Hence, the grouping of „similar assets“ for training is a harder task in bonds if one wishes to adjust for all these granularities.

To make matters worse, equities can almost always be assumed to be perpetually existing unless in the case of corporate events. On the other hand, bonds can almost always be assumed to expire at some point in time, except in the occasional case of perpetual bonds. This means that the universe refresh rate is exceedingly high. This presents many challenges for machine learning algorithms, not least limited to inconsistent dataset sizes or the unknown extent of survivorship bias vs. maturity effect. Datasets, therefore, need to be asset-agnostic to a certain degree and carefully constructed to maintain comparability.

High dataset imbalance: In equities we can either frame the problem as a price prediction or a returns prediction, either of which can be calculated by the prices of the equities (split/dividend-adjusted, which are still just intrinsic datapoints of the equities). In bonds, we can either frame the problem as a price prediction or a credit spread prediction. The former is a bond datapoint and the latter a combination of the bond yield versus the risk-free rate, typically a US Treasury bond. Here, we are implicitly predicting for „interactions“ between two different assets— the bond and the risk-free rate. Moreover, when we train for a target label of a minimum spread widening/narrowing, we find stark class imbalances. These are more pronounced than the same setting in equities of minimum return requirements. The imbalance often calls for the need of readjusting the loss function where for trading cost reasons we would value one class over the other. For example, it is easier going long on a bond than to short a bond compared to the equities world.

Trustworthiness of data: The challenges above are compounded by the deteriorating quality of data in bonds of lesser-known issuing entities or lower credit ratings. In a trading landscape where OTC trading still contributes a significant share of the liquidity, bid/ask data and volume recorded from electronic markets are sometimes misleading, and worse, untradeable. This not only influences the training of the models but also the executability of credit trading signals. Often, this means sanity checking the data manually. The trustworthiness of data also feeds back to design on the type of trading decision horizons and therefore the target labels for the credit model.

Mapping of entities: Many commercial data providers carry their own asset mapping IDs. As bonds are issued by firms that, most of the time, also have issued their own shares means that we have an incentive to link equity IDs to the bond IDs. The mapping is important for understanding where the bonds lie on the capital structure and what credit risks they bear. This is less of a problem when one sources data from the same data provider but quickly becomes a tedious task when mapping across databases.

Measuring the quality of a systematic credit model

For any system, there must be a way to conduct quality checks. For machine learning systems, we can rely on metrics such as accuracy, error sizes, f1-scores etc. However, these might not be sufficient for models that produce forecasts for more illiquid holdings. On longer holding periods, it is important to understand the models from a fundamental analyst’s perspective. This means 1) understanding the behaviour of different machine learning systems and algorithms, 2) understanding the contribution and importance of different input features, and 3) understanding the variability of model outputs.

Model response to datasets: We know that different algorithms respond differently to the same dataset. Training an ARMA model will yield different outcomes as a Gaussian process model. Therefore, we need to monitor the performance of each model for the same dataset on their out-of-sample prediction power. Given known issues with input data and potential clustering of erroneous data, it is also important to understand how the algorithms respond to corrupted data at various segments of the datasets, i.e., response to adversarial attacks. As different models have different data requirements, i.e., i.i.d. variables for some statistical models, and large enough datasets for neural nets, we also investigate the models’ performance when varying sizes of datasets. However, this sometimes results in overgeneralizing and glossing over key differentiating features of bonds. Understanding these aspects is key to choosing models given our aforementioned challenges of persistently wide datasets in credit space.

Feature importance: As we vary the models, the large number of data points we feed into the models makes it hard to differentiate which really contain information, and which are simply noise. We can select features by comprehensively searching through the perturbation of features to identify gains in e.g., accuracy. But this is extremely computationally expensive and only works for one instance of the {model, dataset} set when we could possibly have multiple datasets over the years and different clusters. We can map the feature importance easily when using an XGBoost model through LIME/SHAP algorithms, but these are not necessarily applicable to the other models; the same goes for statistical tests on model coefficients. A hack is to combine a leave-one-out algorithm with a blanket blackbox model representing the entire system of models to map from a subset of features to our produced signals.

Variability of model outputs: Models produce signals that can change as fickle as I change my mind when choosing flavours in an ice cream parlour. A common way to deal with this is to smooth signals over time through moving averages. For systematic credit strategies, however, we need to intuitively understand the fickle signals – if we smoothen the signals, surely that means we cannot be that confident about the models’ decisions? To deal with the volatile signals, we can look at measuring the uncertainty of predictions via inductive conformal prediction which also nicely avoids the need to consistently retrain models.

About Arabesque AI

​Arabesque AI was founded in 2019 as part of the Arabesque Group. We developed our proprietary Artificial Intelligence Engine to forecast stock prices globally on a daily basis. On top of our AI, we built and launched AutoCIO, a platform that creates bespoke investment strategies. Using AI and Big Data, AutoCIO offers hyper-customization, enabling investors to align their investment and sustainability criteria. At Arabesque, AI is not only a buzzword. We advise over $450mn on our platform, proving that AI is ready to be used in practice.

3 Feb, 2022

TCFD Alignment Barometer: Measuring Climate Disclosure


The Task Force on Climate Related Financial Disclosures (TCFD) was established in 2015 by the Financial Stability Board to develop recommendations for more effective climate related disclosures. In 2017, the TCFD published a set of recommendations to guide companies in providing better climate related reporting, which has since become the global standard for climate disclosures. The TCFD Alignment Barometer, delivered through ESG Book, supports corporates and investors in understanding the TCFD recommendations and the reporting landscape.

To read the full article, click here.

28 Jan, 2022

Cows on the Beach: How AI can work with the unexpected


Our aim at Arabesque AI is to create customisable, actively managed, AI powered investment portfolios. To do so means it must be able to undertake any mandate and not specialise in one area. This generalisability is a core problem in AI research and has many dimensions; it can represent time, geography, or context. For example, in the time dimension you need your models to continue working from the pre- to a post- pandemic world (and during too). Geographically, you need to achieve performance anywhere from Peru to Russia and contextually, you may to apply to both start-ups and established companies. A generalisable AI can adapt to the environment it is in.

Our AutoCIO platform covers 25,000 stocks, in 60 countries and over 80 exchanges with data spanning back in time across many market regimes. AutoCIO allows you to create hyper-customised strategies and explore potentially millions of configurations depending on your investment aims. Creating such generalisation is a difficult task— but even more so in financial data— so what are some of the obstacles modelers face?

Covariate Shift

One problem which makes generalisation difficult is covariate shift. This is when, although the relationship between inputs and outputs remains the same, the distribution of your inputs changes. Imagine training a facial recognition device on spotless photos of you, perhaps headshots or dating profile pictures. When we attempt to generalise to your face first thing in the morning, pre-coffee, it might fail. Your face has not changed, but the context has, and so previously learned relationships may not hold.

Another example, given in this paper by researchers at Facebook, concerns the classification of images of cows and camels. In your training set the animals are in their natural environments; that is cows are in fields and camels in deserts. When we attempt to generalise to unnatural habitats, perhaps cows on beaches or camels in the park we fail spectacularly.

Cows on a beach may seem farfetched, but to those dealing in real world data far stranger things occur. You need to able to spot opportunities or risks however they are presented. Despite training across as wide array of environments as possible it is infeasible to innumerate all environments that will be encountered. The usual mantra of more data does not necessarily help as there are no guarantees the sample will contain the relevant information. Upon entering a new environment not seen before, there are fundamental limits to letting the ‘data talk’.

Cows on a beach – Taken by an (intelligent) Arabesque AI researcher, Sri Lanka 2017

Changing Relationships

Covariate shift describes the process of when the context changes, but the input-output relationship does not. A more intractable problem is when the input-output relationship does change as is often the case with financial time series. In facial recognition this is the case of ageing, injury or plastic surgery, your mother may recognise you, but a naïve machine may not. For cows and camels this is perhaps the new breeds of super cows which more resemble Arnold Schwarzenegger than a typical Aberdeen Angus, recognisable to a farmer but not necessarily your classifier. The rules which you previously learned, whatever the context, may no longer hold.

In finance, rules change as markets move through volatility regimes, rate cycles, expansions, moderations, and contractions. However powerful your method is, the data you observe may be just the remnants of forgotten relationships. A seasoned trader who has survived for decades may recognise a market shift better than a machine trained only on recent data. This is the aim for AI also.

A New Approach

A pervasive problem in AI scenarios is lack of understanding of your input data and its context. In the cow and camel example, if say 90% of all pictures were taken in their natural habitat your machine may be able to minimise error by simply classifying anything in a green landscape as a cow and anything in a brown landscape as camel, with 90% accuracy. The problem is that you have not uncovered the causal mechanism that makes a cow a cow, you have identified a spurious correlation that will not generalise. Instead, we want to find robust patterns.

These problems are leading AI researchers to rediscover the power of structured models, driven by domain-specific knowledge. Understanding your data features, their context, structure and meaning can lead to more applicable results. The paper which contains the cow/camel example shows one example of this, an approach they call Invariant Risk Minimisation (IRM). In this approach the data is split into two environments – natural (N) and unnatural (U). Instead of having a model (like the cow-camel example above) that performs with 100% success in N and 0% in U (for a total of 90%) we instead encourage the learning of relationships that are stable across environments. We may end up with an accuracy of only 75% in each of N and U but we have forced the model to focus on those immutable characteristics which make a cow a cow wherever it is – invariance to environment signals that you have found a true causal relationship.

At Arabesque AI we continually research and update our models and when we do, we want to see improvements across all environments, meaning it is more likely we have found true relationships instead of spurious correlations. As a closing example consider another aspect of our business – sustainability. A company that has achieved great success by unsustainable (or simply corrupt) practices is unlikely to continue the performance you see in sample out of sample. However, applies even more as the rules of business change, behaviour that was profitable before may no longer work. Diverse, sustainable, and well-governed companies are better placed to survive across market scenarios, they are more invariant to the environment. Therefore, at Arabesque we believe that carefully constructed AI and sustainability will help to navigate through the environments which lay ahead.

13 Dez, 2021

The Anatomy of Technology-Driven Climate Investing


The COP26 summit is behind us, and although the agreements made by the 196 countries nudged the world closer to a net-zero pathway, there is still a mountain to climb. The Glasgow Climate Pact calls on governments to “Accelerate the development, deployment and dissemination of technologies, and the adoption of policies, to transition towards low-emission energy system”, including “accelerating efforts towards the phasedown of unabated coal power and phase-out of inefficient fossil fuel subsidies”. Greenhouse Gas (GHG) emissions need to fall by 45% compared with 2010 levels by 2030 if the world is to stay on track to reach net-zero by around mid-century. The current trajectory, however, is estimated to be 13.7% above the 2010 level in 2030. The challenge is stark.

This article will outline how to build robust and effective climate pathway strategies using (imperfect) ESG data, analytics, and create technology-generating active market returns in our collective race to Net-Zero. Quite simply, there is no time to wait. 

To read the full article, click here.

23 Nov, 2021

ESG-Daten als öffentliches Gut?


Die Verbreitung von Nachhaltigkeitsdaten ist mit (nicht unerheblichen) Kosten für die Datenersteller (in der Regel die berichtenden Unternehmen selbst), Investoren, NGOs, Hochschulen und andere Stakeholder wie Regulierungsbehörden und politische Entscheidungsträger verbunden. In Anbetracht der relativ jungen und undurchsichtigen Natur der Nachhaltigkeitsberichterstattung ist es nicht überraschend, dass die Erhebung, Systematisierung und Analyse von Umwelt-, Sozial- und Governance-Daten (ESG) mit erheblichem Zeit- und Finanzaufwand verbunden sind.

Da die Ausgaben für ESG-Daten weiterhin mit einer jährlichen Rate von 20 % steigen, werden sie bis Ende 2021 voraussichtlich 1 Mrd. USD erreichen [1]. Dies wiederum hat etablierten Unternehmen und Marktteilnehmern Arbitragemöglichkeiten eröffnet, die von der begrenzten Transparenz und den hohen Eintrittsbarrieren im Zusammenhang mit der Bereitstellung von ESG-Daten profitieren wollen.

Da die Abhängigkeit von ESG-Daten zunimmt, ergeben sich erhebliche Möglichkeiten, die Kosten für den Zugriff auf ESG-Daten zu senken – sowohl für die berichtenden Unternehmen als auch für die Endnutzer der Daten. Angesichts der zunehmenden Aufmerksamkeit würde eine weitere Integration von Nachhaltigkeitskennzahlen in die Finanzmärkte der Qualität und der Wirkung von Nachhaltigkeitsinformationen zugute kommen. Dieser Blog befasst sich mit der Idee von ESG-Daten als frei zugängliche Informationen [2].

Die Verwendung und Meldung von ESG-Daten ist (vorerst) eine kostspielige Angelegenheit

In einer von der Europäischen Kommission durchgeführten Studie über die voraussichtlichen Kosten für die Einhaltung der neuen EU-Richtlinie über die Nachhaltigkeitsberichterstattung von Unternehmen (CSRD)  erwarten die politischen Entscheidungsträger, dass sich die jährlichen Kosten für die Berichterstattung der 49 000 europäischen Unternehmen, die in den Anwendungsbereich der Richtlinie fallen, auf nicht weniger als 3,6 Mrd. EUR belaufen werden. Dabei sollen 1,2 Mrd. EUR auf einmalige Umsetzungskosten entfallen. 

Diese steigenden Daten- und Compliance-Ausgaben werden kleinere Unternehmen unverhältnismäßig stark treffen, die nicht so gut mit hochentwickelten Corporate Social Responsibility (CSR)- oder Nachhaltigkeitsabteilungen ausgestattet sind wie ihre großen Konkurrenten. 

Auf der anderen Seite wird die Nutzung von ESG-Daten immer ressourcenintensiver, da ESG immer relevanter wird und immer mehr Unternehmen und andere Marktteilnehmer ihre Daten offenlegen. Die Notwendigkeit, ESG zu berücksichtigen, hat sich in den letzten Jahren sowohl für berichtende Unternehmen als auch für die Endnutzer verstärkt: ESG hält wirklich Einzug in den Mainstream. Dennoch sind die Kosten für den Zugriff auf ESG-Daten nicht so schnell gesunken, wie die Nachfrage nach ESG-Daten gestiegen ist. 

Offenlegung von ESG-Daten

Was wäre, wenn ESG-Daten zu einem öffentlichen Gut würden? Leichter zugängliche ESG-Daten können den Bedürfnissen der Marktteilnehmer nach besser informierten und nachhaltigkeitsorientierten Anlageentscheidungen gerecht werden. Gleichzeitig können ESG-Informationen als allgemein verfügbares, öffentliches Gut Einzelpersonen und Verbraucher dienen, die richtigen Entscheidungen hinsichtlich der Produkte und Dienstleistungen zu treffen, die sie täglich kaufen und in die sie investieren. 

Natürlich ist nicht jeder daran interessiert (oder in der Lage), Nachhaltigkeitsinformationen für seine Investitionen oder Einkäufe zu verfolgen. Aber wie bei den Standard-Finanzinformationen börsennotierter Unternehmen macht auch hier der einfache Zugriff auf standardisierte Unternehmensdaten einen großen Unterschied. Zeitungen und andere Medien erleichtern die Informationseffizienz auf den Finanzmärkten und in der Wirtschaft auch deshalb, weildie Finanzdaten börsennotierter Unternehmen als öffentliches Gut gelten.

In der Praxis können richtige Technologie und effizienzsteigernde Tools die Grenzkosten für standardisierte ESG-Datenberichte und -zugriffe senken. Es ist zwar unwahrscheinlich, dass ein solcher Übergang über Nacht erfolgen kann, aber die Möglichkeit für Stakeholder, ihre Offenlegung zur Nachhaltigkeit zu rationalisieren, die Berichterstattung zu zentralisieren und für einen standardisierten Datenzugriff zu optimieren, kann zu erheblichen Skaleneffekten führen. Kapazitätsaufbau und Technologie werden das Herzstück einer besser funktionierenden Nachhaltigkeitsdatenlandschaft sein, die auch die Bedenken hinsichtlich „Greenwashing“ und falscher Darstellung der Nachhaltigkeitsleistung eines Unternehmens ausräumt. 

Nachhaltigkeitsinformationen könnten unter die Bezeichnung „digitales globales Gemeingut“ fallen. Das bedeutet, dass diese Art von Daten für die Verwirklichung unserer globalen Agenda für nachhaltige Entwicklung so wichtig ist, dass sie der Öffentlichkeit und der beitragenden Gemeinschaft kostenlos zur Verfügung gestellt und zugänglich gemacht werden sollten, ähnlich dem Konzept des Cyberspace, der für alle verfügbar und nutzbar ist. Denken Sie an Wikipedia, aber für ESG-Daten. 

In den Worten von Mayo Fuster Morell sind digitale globale Gemeingüter tendenziell „nicht-exklusiv, d. h. sie sind (im Allgemeinen frei) für Dritte verfügbar. Sie sind also eher auf die Nutzung und Wiederverwendung als auf den Austausch als Ware ausgerichtet. Außerdem kann die Gemeinschaft der Menschen, die sie aufbauen, in die Steuerung ihrer Interaktionsprozesse und ihrer gemeinsamen Ressourcen eingreifen“. [3]

Darüber hinaus hat die Europäische Kommission im Oktober 2020 ihre neue Open-Source-Software-Strategie 2020-2023 verabschiedet. Das Hauptziel der Strategie ist die Möglichkeit, eine europaweite digitale Souveränität zu erreichen. Sie soll Europa ermöglichen, seine digitale Autonomie zu bewahren und Innovation, Kreativität und bahnbrechende technologische Fortschritte vorantreiben. 

Die Vorteile von öffentlich zugänglichen Nachhaltigkeitsdaten

Können wir Nachhaltigkeitsdaten als ein digitales globales Gemeingut betrachten? Um die Anzahl der Standpunkte und Auswirkungen zu berücksichtigen, spricht vieles dafür, dass der grundlegende Zugang zu ESG-Daten und -Berichten für alle besser möglich sein sollte, von den größten börsennotierten Unternehmen der Welt bis hin zu den kleinsten Familienbetrieben. Ein solches Maß an Transparenz, Informations- und Offenlegungsfreiheit würde wiederum einen effizienteren Kapitalfluss in Unternehmen ermöglichen, die die Kriterien für nachhaltiges Investieren wirklich erfüllen. 

ESG-Daten haben ihre Grenzen, und Greenwashing ist ein echtes Problem. Einige Nachhaltigkeitsbemühungen fallen unter das Dach des Marketings, um Verbraucher und Investoren zu beschwichtigen, die bereit und in der Lage sind, für ein ESG-„Label“ zu zahlen. Detaillierte und maßgeschneiderte ESG-Analysen können sehr komplex werden. Die Kenntnis von ESG-Daten kann nicht unter allen Umständen kostenlos sein, aber die Senkung der Hürden für die Berichterstattung und den Zugang zu ESG-Informationen würde zu unserem gemeinsamen Verständnis dafür beitragen, wie sich Nachhaltigkeitsthemen auf unsere Volkswirtschaften und langfristige Investitionsentscheidungen auswirken können. Eine bessere Feedbackschleife zwischen den Anbietern von Nachhaltigkeitsdaten (oder den berichterstattenden Stellen) und Endnutzern würde dazu beitragen, einige der dringendsten Herausforderungen bei der Integration von ESG-Daten zu bewältigen. Es ist ein fortlaufender Prozess der Informationsbeschaffung, Korrektur und Aktualisierung. Technologiegestützte Transparenz kann die Organisationen hervorheben, die ESG tatsächlich ernst nehmen – und sie zum Vorbild für andere machen.

[1] How to Combat Greenwashing? Find the Right Data Partner  

[2] Diese Zahlen gelten zusätzlich zu den Offenlegungskosten im Zusammenhang mit der EU-Taxonomie in Höhe von 1,2 – 3,7 Mrd. EUR an einmaligen Kosten sowie 600 – 1.500 Mio. EUR an wiederkehrenden Kosten pro Jahr. Vorschlag für eine RICHTLINIE DES EUROPÄISCHEN PARLAMENTS UND DES RATES zur Änderung der Richtlinie 2013/34/EU, der Richtlinie 2004/109/EG, der Richtlinie 2006/43/EG und der Verordnung (EU) Nr. 537/2014 in Bezug auf die Nachhaltigkeitsberichterstattung von Unternehmen. 

[3] Fuster Morell, M. (2010, S. 5). Dissertation: Governance von Online-Schöpfungsgemeinschaften: Bereitstellung einer Infrastruktur für den Aufbau digitaler Allmenden.

1 Nov, 2021

The proof is in the textual pudding: finding financial signal in language data


Word soup: working with unstructured data 

It is easy to assume that there must be some information in text data which is relevant to market prices. For example, a company’s share price could be negatively affected after a scandal or lawsuit (such as Facebook after Cambridge Analytica), or the price of a commodity changes as a result of environmental and geopolitical events (such as the recently soaring gas prices following last year’s cold winter and increased demand from Asia). Data sources such as news, social media and company’s internal communications have been shown to have some level of correlation with the market. However, extracting such information is far from an easy task. A first issue is that this type of data, unlike traditional financial data, is what is usually called ‘unstructured’ data. This means that it does not come in a readily usable numerical or tabular format i.e., nicely organised rows and columns with headers as you find in an Excel sheet. In addition to this, computers only understand numbers!  

Graphical user interface, text

Description automatically generated
Why are gas prices so high and what is happening to fuel bills? - BBC News

Image Source: https://www.bbc.co.uk/news/business-58090533  

Due to  this, we need to transform the data in several ways in order to obtain a format which can be used in downstream tasks, such as predicting how the price of a financial instrument is likely to evolve over time. In general, the steps for doing this involve the following pattern:  

  • Gathering data For example scraping the web, using a news data provider or query a social media’s API, i.e., their user interface. 
  • Cleaning, or pre-processing the data. Depending on the type of raw data, application and model used there can be as little or as much of this as required. Examples of pre-processing include lowercasing and removing stop words i.e., words which have little meaning content, such as ‘be’, ‘the’, ‘of’, etc. 

When the data is scraped from web pages, there can be a lot more work! Think of looking for meaningful content in a pure text version of an article from your favourite news page- without the graphics to help you distinguish between headers, timestamps, social media links, references etc. 

  • Turning the data into a numerical representation. There are many ways of doing this, with state-of-the-art methods involving models pre-trained on very large amounts of text data which will ‘ingest’ your text and spit out lists of numbers. The choices you make are still very important: Which model? Trained for what task? How do you split your data? How do you combine the results? 
  • Building a model. Which will use your text data to perform some task.  

Let’s have a look at two tasks using text data which are highly relevant to financial applications.  

Types vs. tokens: Identifying entities 

A first important task when processing textual data involves identifying entities e.g., people such as ‘Steve Jobs’ or organisations such as ‘Microsoft’. This is called Named Entity Recognition (or NER). The way to do this is by gathering relevant data and labelling many instances of it, e.g., labelling ‘Steve Jobs’ as PERSON and ‘Microsoft’ as ORGANISATION, then training a model which will learn which types of words are likely to be instances of either type.  

A subsequent step, which can be performed after this, is to link the entity with a reference in the real world through the use of a database. For example, once I have identified that ‘Steve Jobs’ is an entity of the type PERSON, I can go into a database such as Wikipedia to try and find whether there is a PERSON entry with this name. If there were more than one (such as would be the case with, e.g., Amazon– the forest and the company), I will have to use the context of the sentence along with each database entry to try and disambiguate between them by figuring out which is more likely. In the case of Amazon, if the original sentence mentions trees and there is a Wikipedia entry which mentions them too, their numerical representations are likely to be closer, which means Amazon in such a context would get linked to the ‘forest’ entry.  

This is a case where some types of pre-processing should not be performed because they could potentially be harmful to recognition. For example, given that upper and lower casing are useful cues, maybe we should avoid lowercasing, e.g., avoid turning ‘Apple’ into ‘apple’.  

Image Sources: https://www.popularmechanics.com/science/environment/a28910396/amazon-rainforest-importance/https://www.ledgerinsights.com/amazon-job-ad-digital-currency-lead-for-payment-acceptance/  

In practice, modern systems are now fairly robust to these changes. However, it is still a good habit to keep in mind the type of data we are using and make sure it is processed in a tailored way. 

Consider how some types of text, such as news headlines, might confuse systems which rely on casing for entity recognition, e.g.: 

“Viacom18 Plans to Adapt More Paramount Titles for India, Sets New Release Date for Aamir Khan’s ‘Forrest Gump’ Adaptation.” 

As a human, how do you identify which words are named entities in this sentence? How do we get a computer to do this? These are the type of questions we need to ask ourselves as we try and figure out ways to process the very complex type of data that is language data.  

Assessing feelings about assets and events 

Another useful task, making use of our numerical representations of text is sentiment analysis or trying to determine whether a piece of text expresses a positive or negative emotion. Traditionally, this has been achieved in financial contexts by using specialised dictionaries (lists of words with a corresponding positive or negative score). However, this is a difficult task to achieve using only lists of words, as sentiment analysis goes beyond the meaning of individual words (consider negations such as ‘This is not a great film’, or sarcastic sentences such as ‘This was a really great film. I absolutely did not fall asleep halfway through’).  

The usual way to perform this task is once again to gather relevant data where pieces of text have been labelled as ‘negative’ or ‘positive’ (for example, movie reviews) and train a model to recognise which types of sentences are associated with which emotions. The data used when you are teaching your model should ideally be as close as possible to the data you will be using the model with. For example, if you learn to identify sentiment in movie reviews, you may not perform as well with financial news, as these are fairly different types of text, with different underlying principles and different author populations.  

Another thing to keep in mind is that ‘sentiment analysis’ generally means analysing the overall sentiment expressed in a piece of text. It does not necessarily say anything about the valence (emotional score) of an event. For example, the event of a zoo closing may be positive from the point of view of an author worried about animal rights, but negative from the point of view of the zoo staff. Similarly, the event of a hostile takeover, such as that of Cadbury by Kraft Foods in 2009, may carry positive emotions for the latter and negative emotions for the former, but a piece of text will only have one sentiment score.  

Even with state-of-the-art models, performance in this task is still far from perfect, as they still struggle with negation and sarcasm, as well as displaying worrying consistency issues and bias e.g., predicted sentiment changing when swapping names or locations in the text, which should not happen. This is partially due to the way these models are trained. Being left to figure out which language predicts sentiment without much guidance, they may pick up on correlations which we would consider meaningless e.g., every instance of text including Brazil that the model received happened to be negative, therefore Brazil will bias the prediction towards negative sentiment.  

Aligning cause and effects 

Finally, suppose you have performed the previous two tasks, i.e., you have identified which assets are referred to in your text data, and which sentiment is expressed each time. How do you use this in relation to the market? How do you know whether this sentiment will impact prices, and when?  

This is probably the trickiest bit. While sentiment extracted from news and other similar sources does correlate to some extent with the market, this has often been demonstrated by aggregating over many years of data, using high quality curated data or manually selecting relevant news. What is more, there are some effects which make the task even harder. First, it has been shown that the market starts to react to a piece of news sometimes up to two weeks before it is published (as the content of announcements will often be common knowledge before they are made official). Second, different types of news and media content will have different lags and timelines in how they affect the market. For example, the market does not react in the same way to positive and negative events. Lastly, most text data gathered is likely to be irrelevant and have no impact whatsoever.  This makes it very difficult to train a model, as it will have to work with a very low information to signal ratio, just like is the case for the price data! In fact, the use of noisy language data on top of already-noisy financial data is what makes this a very difficult problem to solve indeed.   

18 Okt, 2021

ESG, News and the power of Natural Language Processing


Social media and online news have fundamentally changed the way people interact with companies. Posts on platforms like Twitter or LinkedIn, along with blogs and online news articles, provide accounts of stakeholder experiences with companies and their perception of corporate behaviour and allow for the rapid spread of these views. The latter shapes stakeholder perspectives and informs stakeholder actions. As such, social media and online news quickly mirrors and shapes corporate reputation, societal legitimacy, social license to operate, and stakeholder trust [1]. An illustrative example is H&M’s „trashgate“ scandal, where store personnel were found to be damaging and dumping unsold clothes in the garbage instead of donating them. Starting with an article in the New York Times, public outrage quickly spread across social media [2]. It was one of the top-three trending topics on Twitter and remained so for several days. Only after the outrage, H&M decided to address the issue. After investigations, it was discovered that the particular New York store was violating the company’s policy which was to donate unsold clothing to charity. “Trashgate” was one of the first examples to show the ways in which social media could raise issues to news  coverage, affect corporate publicity, and force companies to change their actions [3]. It also illustrates how social monitoring can be a powerful asset in the world of sustainability, especially in terms of evaluating Environmental, Social and Governance (“ESG”) risk.

Over recent years, the use of ESG data and analytics has boomed in capital markets [4]. Real-time news and social media data are receiving increasing attention in cutting-edge decision-making strategies. This popularity is grounded in the ability of ESG data to provide insights that are absent from typical financial data. Traditional financial information has limited usefulness to investors today as it allows for data that is both backwards-looking and that only encompasses a narrow financial base. As such, it is insufficient on its own to assess a company’s ability for future profit. For example, financial data did not indicate potential unethical behaviour by H&M, and it only picked up on the reputational (and financial) damage thereof once it had already happened. Therefore, both retail and institutional investors increasingly focus on ESG factors to assess companies. This is supported by ESG research that shows the positive relationship between a firm’s profitability and its ESG metrics [5] and illustrates that ESG data can help reduce portfolio risk [6].

However, ESG data in mainstream investing has three main challenges: most ESG data is qualitative, the landscape of corporate disclosures is incomplete and inconsistent, and disclosures are generally voluntary with sparse available data [7][8]. Many pertinent issues do not manifest in disclosures or regulatory filings and, if they do, the delays caused by reporting and publication cycles can cause relevant data to be out of date by the time it is in the public domain. There is also a significant bottleneck in assessing ESG performance due to the manual effort in continuously sourcing and validating disclosure data. This bottleneck is even more prominent when dealing with large volumes of unstructured text data, such as social media or news. As demand for ESG increases, the need for accurate and near real-time responses to ESG issues becomes clear, and the ability to detect and represent such issues through data sources beyond a company’s filings is paramount. In the ever-changing investment landscape, news and social media data utilisation have become critical to ESG investment strategies.

To properly realise the potential of news data, millions of articles need to be processed daily, and one must look towards the power and capability of Machine Learning (“ML”). Latest advances in Natural Language Processing (“NLP”) increase/strengthen our ability to process unstructured text data. Moving away from pre-determined text/keyword ontologies of the past [9], advances in the field of deep learning have pushed the state-of-the-art towards Transformer-based architectures such as BERT [10]. The key advantage here is leveraging context in decision making. Language is complex – for example, homographs exist, words whose meaning is entirely dependent on context. Without contextual understanding, false positives are likely, and many prominent classical methods are known to fall into this trap. Such approaches have focused on words and the frequency of their occurrence, with words weighted by how often they appear. For example, if a corpus of articles frequently mentions the word ‘exploitation’, such techniques can systematically discount its relevance. Similarly, identifying the difference between the word ‘carbon’ in the context of greenhouse gas emissions or when discussing carbon allotropes is critical in understanding the text in question. In other words, “context is king”.

Across the investment community, researchers and engineers are using machine learning in new and disruptive ways, analysing linguistic information from content, using ESG and sentiment data to determine a company’s commitment to ESG, and evaluating the impact of this commitment on stakeholders. [11] Sokolov et al. [12] show how BERT can be used as a classifier to aid in ESG Scoring, with aggregation approaches used on the output to construct a score. Such scores allow investors to recognise and understand what drives high and low ESG performance among their holdings, informing their approach for engagement. For instance, reflecting the impact of “trashgate” in their decision-making process for H&M. These also supplement brand and reputational risk management with a specific focus on sustainability issues and controversies. 

At Arabesque S-Ray, we are committed to providing innovative tools and incisive insight into ESG data to empower businesses and investors. This includes substantial focus on applied NLP research,  perfecting cutting-edge techniques and deepening sustainability expertise to provide granular insight into corporate behaviours. We are working to design the leading NLP-powered ESG-focused tools that will transform the way investors access and use social and traditional media signals in sustainable investing and aiding responsible business.  


[1] – Pekka Aula, (2010),“Social media, reputation risk and ambient publicity management“, Strategy & Leadership, Vol. 38 Iss: 6 pp. 43 – 49

[2] – ‘‘Reputational risk in digital publicity’’ presented at the Viestinna¨n tutkimuksen pa¨iva¨t, February 12th, 2010, Tampere, Helsinki.

[3] – Laaksonen SM. Hybrid narratives: Organizational reputation in the hybrid media system. Publications of the Faculty of Social Sciences. 2017 Jun 16.

[4] – Lev, B., & Zarowin, P. (1999). The boundaries of financial reporting and how to extend them. Journal of Accounting Research, 37(2), 353-385.

[5] – Clark, Gordon L. and Feiner, Andreas and Viehs, Michael, From the Stockholder to the Stakeholder: How Sustainability Can Drive Financial Outperformance (March 5, 2015)

[6] – Friede, G., T. Busch, and A. Bassen. 2015. “ESG and Financial Performance: Aggregated Evidence from More Than 2000 Empirical Studies.” Journal of Sustainable Finance & Investment 5 (4): 210–233

[7] – Park, Andrew & Ravenel, Curtis. (2013). Integrating Sustainability Into Capital Markets: Bloomberg LP And ESG’s Quantitative Legitimacy. Journal of Applied Corporate Finance

[8] – Henriksson, R., J. Livnat, P. Pfeifer, and M. Stump. 2019. “Integrating ESG in Portfolio Construction.” The Journal of Portfolio Management 45 (4): 67–81.

[9] – Lee Y. H., W. J. Tsao, and T. H. Chu. “Use of Ontology to Support Concept-Based Text Categorization.” In Designing E-Business Systems. Markets, Services, and Networks, edited by C. Weinhardt, S. Luckner, and J. Stößer, 201-213. WEB 2008. Lecture Notes in Business Information Processing, vol 22. Berlin, Heidelberg: Springer. 2009.

[10] – Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.

[11] – https://www.raconteur.net/finance/investing/how-machine-learning-is-helping-investors-find-esg-stocks/

[12] – Building Machine Learning Systems for Automated ESG Scoring, Alik Sokolov, Jonathan Mostovoy, Jack Ding, Luis Seco, The Journal of Impact and ESG Investing Jan 2021