📄

FDD Feature Engineering

Openings
Selective Spots
Meeting Minutes
Contributor(s)
Created
Nov 8, 2021
Related to Open Positions DB (Property)
Last Touched
Nov 8, 2021
Event Date
What is this Squad for?
This workstream's primary mandate is to develop a robust set of machine readable features that can be easily procured and inserted into Gitcoin's anti-sybil machine learning classification engine. As we learn more about what features are predictive for the classification task, this squad's goal will be to continually learn more about the flagging process to further refine and optimize the feature set used within the framework.
Short Term Goals
Medium Term Goals
Long Term Goals

Table

🗄️
Contributors DB
Contributor Name
Email
Bandwidth Committed
Contributor Status
Skills
Book Time
Projects
Discord Handle
Onboarding
Working Documents
Role
Workstream
Action Items
Relevant Rounds
Workspace
notion image
Progress
Prototype data procurement from Github, Metabase and other repositories
The Omni Analytics team (Lawrence, Eric, and Yogesh) has begun the process of replicating the "Github Scraping" portion of BlockScience ML workflow. Based on our current understanding of the scrape, three features are being extracted from Github users:
  • Update distance (I believe the time since last updating their profile)
  • Contrib count (number of contributions)
  • Bio length (characters in their bio)
 
After first applying for Github API access and then poking around the endpoints, we were able to identify it was possible to collect data on users, user repos, and a user's followers. Below you can view a screen cap of these three categories, as currently processed through our scraping script.
notion image
There is substantial amount of data offered through the Github API, but we've curated the data down to the following set of items: login/github handle id node_id gravatar_id url type side_admin name company blog location hireable public_repos public_gists followers following created_at_date updated_at_date bio
As mentioned, we also have stats on the repo maintained by a specific user.
notion image
As well as statistics on their activity.
notion image
We believe that this is a solid first pass at collecting relevant variable inputs required to build a set of features useful for discriminating between sybil- and legitimate accounts. As a high level example, it could be suggested that the number of repos + number of stars on those repos would indicate a real users, one who is active in developing open source. An individual with this profile might be unlikely to be sybil due to the amount of effort required to create an account with such an extensive online presence.
The code is hosted in a private repository here
Considerations
Evaluation
  • Select and Transform Variables-Features to create a predictive model Prepare, Extract, Improve the data (Clean Data Please) - Handle missing data... N-A → Handle continuous data and feature before training the model - Handle categorial features → Convert any non-numerical values to integers or floats? Label/Tag encoding One Hot-Encoding (Using Libraries)
  • Feature Selection - Decision Tree, CNN, Backpropagation
Expandable Feature Engineering
  • Build a prototype with Python
  • Validation Set
  • Overfitting data
  • Automated feature engineering
📄

FDD Feature Engineering

Openings
Selective Spots
Meeting Minutes
Contributor(s)
Created
Nov 8, 2021
Related to Open Positions DB (Property)
Last Touched
Nov 8, 2021
Event Date
What is this Squad for?
This workstream's primary mandate is to develop a robust set of machine readable features that can be easily procured and inserted into Gitcoin's anti-sybil machine learning classification engine. As we learn more about what features are predictive for the classification task, this squad's goal will be to continually learn more about the flagging process to further refine and optimize the feature set used within the framework.
Short Term Goals
Medium Term Goals
Long Term Goals

Table

🗄️
Contributors DB
Contributor Name
Email
Bandwidth Committed
Contributor Status
Skills
Book Time
Projects
Discord Handle
Onboarding
Working Documents
Role
Workstream
Action Items
Relevant Rounds
Workspace
notion image
Progress
Prototype data procurement from Github, Metabase and other repositories
The Omni Analytics team (Lawrence, Eric, and Yogesh) has begun the process of replicating the "Github Scraping" portion of BlockScience ML workflow. Based on our current understanding of the scrape, three features are being extracted from Github users:
  • Update distance (I believe the time since last updating their profile)
  • Contrib count (number of contributions)
  • Bio length (characters in their bio)
 
After first applying for Github API access and then poking around the endpoints, we were able to identify it was possible to collect data on users, user repos, and a user's followers. Below you can view a screen cap of these three categories, as currently processed through our scraping script.
notion image
There is substantial amount of data offered through the Github API, but we've curated the data down to the following set of items: login/github handle id node_id gravatar_id url type side_admin name company blog location hireable public_repos public_gists followers following created_at_date updated_at_date bio
As mentioned, we also have stats on the repo maintained by a specific user.
notion image
As well as statistics on their activity.
notion image
We believe that this is a solid first pass at collecting relevant variable inputs required to build a set of features useful for discriminating between sybil- and legitimate accounts. As a high level example, it could be suggested that the number of repos + number of stars on those repos would indicate a real users, one who is active in developing open source. An individual with this profile might be unlikely to be sybil due to the amount of effort required to create an account with such an extensive online presence.
The code is hosted in a private repository here
Considerations
Evaluation
  • Select and Transform Variables-Features to create a predictive model Prepare, Extract, Improve the data (Clean Data Please) - Handle missing data... N-A → Handle continuous data and feature before training the model - Handle categorial features → Convert any non-numerical values to integers or floats? Label/Tag encoding One Hot-Encoding (Using Libraries)
  • Feature Selection - Decision Tree, CNN, Backpropagation
Expandable Feature Engineering
  • Build a prototype with Python
  • Validation Set
  • Overfitting data
  • Automated feature engineering