Circles Network Data Analysis Report
Sep 26, 2023
Verifiable Machine Learning can play a key role in both extending and enhancing functionalities of social graphs through the extrapolation of trust levels, anomaly detection, and integrity assurance. While industries such as social media, finance, and e-commerce widely utilize these approaches, analogous capabilities in Web3 are scarce. As part of a research partnership with CirclesUBI, we at Giza explore the feasibility of meaningful data analysis, model training, and trust score generation.
Here, we introduce the concept of Universal Basic Income (UBI) and the Circles protocol, the challenges in building a robust trust network, and how machine learning can help. We discuss Giza's model development based on the Circles Database, review the methodologies employed for data preprocessing, and wrap up with insights and final observations from this technical analysis.
Circles UBI & Giza
UBI is a long-standing socio-economic policy framework that guarantees every citizen a regular, unconditional stipend irrespective of employment, income, or social standing. Circles aims to be the leading blockchain-native UBI infrastructure, whereby each participant is endowed with a unique, algorithmically minted personal currency connected through a trust-based social graph, namely a Web-of-Trust (WoT). Although WoT can be an effective peer-to-peer trust management system for a non-bureaucratic decentralized identity framework, it has its inherent weaknesses. These include challenges around establishment of initial trust, decay of extant trust relationships, and Sybil attack risks.
In order to address these, Giza and Circles are researching ways in which verifiable machine learning–in particular algorithmic trust scoring–can enhance the UBI network’s resilience against Sybil threats. This report is the first analytical output from this collaboration, highlighting initial observations derived from data-driven explorations.
Technical Analysis Summary
With the objective to identify potentially malicious agents and fake accounts in the Circles Network using machine learning, Giza has conducted a network-based data analysis. The project consists of a relatively conventional order of preliminary processes, with data collection and initial exploration followed by data cleaning, preprocessing, and feature extraction.
Data Collection & Preprocessing
The data used in the project comes primarily from Circles Blockchain Indexer to obtain network data of users, in addition to small-scale labeled datasets of previously detected fake accounts provided by the Circles team. Historic network interactions such as transactions and trust events were collated from the Indexer database for further analysis. After removing outliers and cleaning the data, three additional features were added:
1. Trusted/Trusting Ratio: The ratio of Inwards/Outwards Trust. Used in Weighted Trusted/Trusting Ratio formula.
2. Weighted Trusted/Trusting Score
[Formula For Weighted Trusted/Trusting Ratio : $$ f(w( Ratio-Inwards/OutwardsTrust), x(#-total-Trusted/Trusting Accounts ) = (x^(0.5)) * (1 / (1+Abs(Log(w)))) $$]
• The score takes the root sum of the total number of trusted & trusting accounts, and gives a weighted score based on the Trusted/Trusting Ratio calculated beforehand.
• Score is maximized for a Trusted/Trusting Ratio of 1, and symmetric around 1 ( f(0.5,x) = f(2,x) ).
• This score is meant to reflect the “popularity” of a user in the network, with an emphasis on users who have a “healthy” ratio of people who trust them, in comparison to people whom they trust.
3. Token Distribution Ratio: The ratio of tokens currently in account balances of non-token owners. The higher the Token Distribution Ratio, the more integrated the token (and consequently the owner) is in the network.
During the initial data analysis, various observations were made regarding the usage of the network's features and the general distribution of network participants.
A key observation was that out of ≈120000 signed accounts, only ≈2500 accounts had actively interacted within the network in the 3-month period immediately prior to the analysis, starting from 01/04/2023. This factor heavily skews the analytics done, since 98% of the accounts all have minimum values in network participation metrics, providing no insight into the different behavioral patterns of real and fake users. In order to discriminate between active and inactive users, a filtered dataset has been created with participants who have interacted within the network in the last 3 months.
Figure 1 shows the differences between the unfiltered and the filtered dataset for the distribution of the weighted Trusted/Trusting ratio score. In the unfiltered dataset, the observation has been made that there is a larger proportion of network participants who are either below or above the ideal score (Trusted/Trusting ratio of 1) which is represented with the blue line.
Secondly, the “untrust” feature, where a user would lower the token limit in a trust relationship from a positive value back to zero, is very underutilized, with approximately only 1% of the accounts having ever untrusted another account. Lacking such data is significant as untrusting would have provided very valuable insight for the identification of fake/fraudulent accounts.
As of now, there is no sufficient large-scale dataset of labeled fake and real accounts to use supervised learning algorithms. As a result, Giza has decided to approach this project as an unsupervised learning task, experimenting with various algorithms on both databases: the unfiltered database of 120000 accounts, and a filtered version that contains 2500 active users.
With both datasets, the following approaches have been used:
Clustering algorithms that look for distinct and collective behavioral patterns, used to identify fake or malicious accounts in the network.
Recommender algorithms that predict and recommend potential friends based on individuals’ presence in the community structure. Recommender algorithms stack on top of community algorithms, which use network graphs to identify emergent communities in the network and sort isolated participants.
For the unfiltered dataset, both approaches predictably failed to provide satisfactory results. The 98% inactive user ratio diluted the distribution of features for the aforementioned algorithms, and no statistically significant outcome was observed, other than reinforcing the existence of active and inactive users as two distinct account types in the network.
For the filtered dataset, the network graph based approach provided salient outcomes. In particular, the analysis identified a large, connected community of users–which the Circles team believes to be located in Berlin, the geographical epicenter of the network’s community– as well as various smaller communities. Clustering approaches were less successful, which is expected given the small sample size of the dataset.
The topology of the trust network in Circles is relatively scale-free, meaning the degree of its nodes approximates to the 80:20 rule, a scaling relationship also known as the power law. Scale-free network is a very common type of topology for social networks, where a few, very socialized individuals are connected to significantly more people than the median. This is indicated with the light-colored dots in Figure 2. The number of followers on social media accounts also typically exhibits this relationship.
Specific to Circles Network is the observation that these individuals also act as intermediaries in token transfers i.e. transitive exchange. In essence, being trusted by one of the 'well-connected' individuals on the Circles Network opens up the possibility to exchange tokens with not only those individuals but also their trustors, which provides an additional layer of incentives for less-connected people to gain the trust of well-connected individuals. It is difficult to isolate the consequences of this additional economic incentive on the topology of the network graph. It is undoubtedly an interesting phenomenon that warrants further exploration.
Although they present some interesting patterns about the network, at present both machine learning models trained for this study fail to accomplish the goal of identifying fake/fraudulent accounts consistently and accurately. This is in large part due to the lack of data available. In expectation of increased economic incentives for active participation with the ‘Group Currencies’ feature coming in the following months, it is reasonable to expect that the network will attract more users, of both genuine and malicious types. This increased user base is highly likely to make both clustering approaches and network-based models more performant and accurate.
Comments and insights are highly valued at Giza. For research enquiries and contributions please join the Discord and follow the #research channel or reach out to firstname.lastname@example.org.