This workstream's primary mandate is to develop a robust set of machine readable features that can be easily procured and inserted into Gitcoin's anti-sybil machine learning classification engine. As we learn more about what features are predictive for the classification task, this squad's goal will be to continually learn more about the flagging process to further refine and optimize the feature set used within the framework.

Short Term Goals

Medium Term Goals

Long Term Goals

Table

🗄️

Contributors DB

Contributor Name

Bandwidth Committed

Contributor Status

Skills

Book Time

Projects

Discord Handle

Onboarding

Working Documents

Role

Workstream

Action Items

Relevant Rounds

👤

Dries Smith

~5-10 hours/week

Current Contributor

FDD Anti-Sybil ML Stream

Dries#0479

📄

FDD Feature Engineering

🧑‍💻

FDD Squad

🔎

Fraud Detection & Defense

Workspace

Progress

Prototype data procurement from Github, Metabase and other repositories

The Omni Analytics team (Lawrence, Eric, and Yogesh) has begun the process of replicating the "Github Scraping" portion of BlockScience ML workflow. Based on our current understanding of the scrape, three features are being extracted from Github users: 

Update distance (I believe the time since last updating their profile)

Contrib count (number of contributions)

Bio length (characters in their bio)

After first applying for Github API access and then poking around the endpoints, we were able to identify it was possible to collect data on users, user repos, and a user's followers.  Below you can view a screen cap of these three categories, as currently processed through our scraping script.

There is substantial amount of data offered through the Github API, but we've curated the data down to the following set of items:

login/github handle
id
node_id
gravatar_id
url
type
side_admin
name
company
blog
location
hireable
public_repos
public_gists
followers
following
created_at_date
updated_at_date
bio

As mentioned, we also have stats on the repo maintained by a specific user.

As well as statistics on their activity.

We believe that this is a solid first pass at collecting relevant variable inputs required to build a set of features useful for discriminating between sybil- and legitimate accounts.  As a high level example, it could be suggested that the number of repos + number of stars on those repos would indicate a real users, one who is active in developing open source. An individual with this profile might be unlikely to be sybil due to the amount of effort required to create an account with such an extensive online presence.

The code is hosted in a private repository here

Considerations

Evaluation

Select and Transform Variables-Features to create a predictive model Prepare, Extract, Improve the data (Clean Data Please) - Handle missing data... N-A → Handle continuous data and feature before training the model - Handle categorial features → Convert any non-numerical values to integers or floats? Label/Tag encoding One Hot-Encoding (Using Libraries)

Feature Selection - Decision Tree, CNN, Backpropagation

Expandable Feature Engineering

Build a prototype with Python

Validation Set

Overfitting data

Automated feature engineering

FDD Feature Engineering

What is this Squad for?

Workspace

Progress

Prototype data procurement from Github, Metabase and other repositories

Considerations

Evaluation

Expandable Feature Engineering

What is this Squad for? 

This workstream's primary mandate is to develop a robust set of machine readable features that can be easily procured and inserted into Gitcoin's anti-sybil machine learning classification engine. As we learn more about what features are predictive for the classification task, this squad's goal will be to continually learn more about the flagging process to further refine and optimize the feature set used within the framework.

Short Term Goals

Medium Term Goals

Long Term Goals

Table

🗄️

Contributors DB

Contributor Name

Bandwidth Committed

Contributor Status

Skills

Book Time

Projects

Discord Handle

Onboarding

Working Documents

Role

Workstream

Action Items

Relevant Rounds

👤

Dries Smith

~5-10 hours/week

Current Contributor

FDD Anti-Sybil ML Stream

Dries#0479

📄

FDD Feature Engineering

🧑‍💻

FDD Squad

🔎

Fraud Detection & Defense

Workspace

Progress

Prototype data procurement from Github, Metabase and other repositories

The Omni Analytics team (Lawrence, Eric, and Yogesh) has begun the process of replicating the "Github Scraping" portion of BlockScience ML workflow. Based on our current understanding of the scrape, three features are being extracted from Github users: 

Update distance (I believe the time since last updating their profile)

Contrib count (number of contributions)

Bio length (characters in their bio)

After first applying for Github API access and then poking around the endpoints, we were able to identify it was possible to collect data on users, user repos, and a user's followers.  Below you can view a screen cap of these three categories, as currently processed through our scraping script.

There is substantial amount of data offered through the Github API, but we've curated the data down to the following set of items:

login/github handle
id
node_id
gravatar_id
url
type
side_admin
name
company
blog
location
hireable
public_repos
public_gists
followers
following
created_at_date
updated_at_date
bio

As mentioned, we also have stats on the repo maintained by a specific user.

As well as statistics on their activity.

We believe that this is a solid first pass at collecting relevant variable inputs required to build a set of features useful for discriminating between sybil- and legitimate accounts.  As a high level example, it could be suggested that the number of repos + number of stars on those repos would indicate a real users, one who is active in developing open source. An individual with this profile might be unlikely to be sybil due to the amount of effort required to create an account with such an extensive online presence.

The code is hosted in a private repository here

Considerations

Evaluation

Select and Transform Variables-Features to create a predictive model Prepare, Extract, Improve the data (Clean Data Please) - Handle missing data... N-A → Handle continuous data and feature before training the model - Handle categorial features → Convert any non-numerical values to integers or floats? Label/Tag encoding One Hot-Encoding (Using Libraries)

Feature Selection - Decision Tree, CNN, Backpropagation

Expandable Feature Engineering

Build a prototype with Python

Validation Set

Overfitting data

Automated feature engineering