YouTube Data Analysis Project
This project is an analysis of YouTube (Global) data to understand the trends of plant breeding related 8,971 videos with 2.1 billion views for 198 keywords in 3 categories of concepts (Old, Current, Modern) (English language only).
Problem Statement
There are many SEO tools available to analyze the trends of keywords. But not efficient to analyze the trends on broader horizon with simultaneous analysis of multiple categories of keywords with interaction of audience. I want to analyze the trends on broader horizon utilizing the knowledge of Python and YouTube Data API. As a student of Crop Science (Plant Breeding and Seed Science) together with my group memeber, Samaneh Javidian, we wanted to analyze following things:
- Understanding the trends on broader horizon and impact of plant breeding research overtime,
- Analyzing how the audience on YouTube interact with plant breeding related videos as per 3 categories of concepts.
- Analyzing the content gap in plant breeding related videos.
- Identifying emerging topics and assessing future directions in plant breeding research.
Methodology
- Collect data from YouTube using the YouTube API based on specific plant breeding keywords.
- Save the data categorized in 3 categories in a structured format in csv file for further analysis. Following data is saved:
- Keyword: 66 keywords in each category (total 66 x 3 = 198 keywords).
- Category: Old, Current, Modern
- Video ID
- Video Title
- Published Date
- Duration of the video
- View Count
- Like Count
- Comment Count
- Analyze the collected data to determine trends in viewership and engagement.
Technologies Used
- API and Data Handling:
- YouTube Data API v3
- Python libraries:
pandas
,numpy
, `
- Visualization and Analysis:
matplotlib
,seaborn
,plotly
,tabulate
- Machine Learning and NLP:
NLTK
,BERT
- Data Storage and Preprocessing:
- CSV for structured data storage
Keyword Classification
Here we try to use an analytical approach to classify keywords into 3 categories with the help of historical relevance, technological advancement, and future research trends in plant breeding.
1. Old (Traditional Methods): Keywords in the Old category were derived from practices dominant before the 20th century, emphasizing phenotype-based selection and indigenous approaches.
- These keywords are rooted in the practices before the advent of molecular biology.
- Focuses on phenotypic selection, landrace improvement, and methods relying heavily on field observation and generational improvement.
- Emphasizes low-technology, empirical, and community-based methods.
2. Current (Established/Conventional Methods): Keywords in the Current category represent methods broadly adopted in the late 20th century and still in use today, often integrating molecular biology for improved efficiency.
- Represent technologies widely used today, incorporating molecular biology but not the most advanced techniques.
- Includes marker-assisted selection (MAS), QTL mapping, and heterosis exploitation.
- These methods aim to balance efficiency and accessibility for commercial and academic breeding.
3. Modern (Cutting-Edge/Future Methods): Keywords in the Modern category highlight cutting-edge and emerging areas driven by computational biology, AI, and genomics.
- Reflect cutting-edge technologies that use computational tools, CRISPR, and omics sciences.
- These methods are either newly implemented or have potential for future adoption.
- Focuses on precision, automation, and integrating bioinformatics for decision-making.
Keywords
Old Plant Breeding (66 Keywords)
- Traditional selective breeding in plant breeding
- Landrace selection in plant breeding
- Phenotypic selection in plant breeding
- Pedigree breeding in plant breeding
- Mass selection in plant breeding
- Clonal selection in plant breeding
- Family selection in plant breeding
- Open pollination in plant breeding
- Hybridization techniques in plant breeding
- Genetic diversity preservation in plant breeding
- Crop domestication in plant breeding
- Traditional farmer selection in plant breeding
- Indigenous breeding methods in plant breeding
- Germplasm conservation in plant breeding
- Heritage crop breeding in plant breeding
- Regional adaptation breeding in plant breeding
- Cross-pollination techniques in plant breeding
- Seed saving practices in plant breeding
- Ancestral crop improvement in plant breeding
- Wild relative breeding in plant breeding
- Traditional varietal selection in plant breeding
- Local adaptation breeding in plant breeding
- Empirical crop selection in plant breeding
- Population improvement in plant breeding
- Traditional genetic diversity in plant breeding
- Manual trait selection in plant breeding
- Phenotype-based breeding in plant breeding
- Generational crop improvement in plant breeding
- Agronomic performance selection in plant breeding
- Community-based breeding in plant breeding
- Geographic adaptation in plant breeding
- Reproductive isolation techniques in plant breeding
- Evolutionary breeding approaches in plant breeding
- Reciprocal recurrent selection in plant breeding
- Line breeding in plant breeding
- Composite crossing in plant breeding
- Mass-pedigree method in plant breeding
- Diallel crossing in plant breeding
- Synthetic variety breeding in plant breeding
- Bulk population method in plant breeding
- Traditional crop rotation effects in plant breeding
- Monoculture improvement in plant breeding
- Natural trait segregation in plant breeding
- Seedbed selection in plant breeding
- Pure-line selection in plant breeding
- Hardy-Weinberg principle in plant breeding
- Traditional soil suitability tests in plant breeding
- Indigenous pest resistance breeding in plant breeding
- Farmer-participatory breeding in plant breeding
- In situ conservation breeding in plant breeding
- Traditional field trials in plant breeding
- Manual self-pollination techniques in plant breeding
- Natural outcrossing studies in plant breeding
- Selective pressure analysis in plant breeding
- Regional cropping system adaptation in plant breeding
- Propagation techniques in plant breeding
- Ancient grain breeding in plant breeding
- Local seed exchange practices in plant breeding
- Historical agronomic practices in plant breeding
- Subsistence crop breeding in plant breeding
- Folk selection methods in plant breeding
- Landrace improvement cycles in plant breeding
- Regional biotic stress breeding in plant breeding
- Crop resilience through generations in plant breeding
- Traditional soil fertility management in plant breeding
- Traditional drought tolerance breeding in plant breeding
Current (Established/Conventional) Plant Breeding (66 Keywords)
- Marker-assisted selection in plant breeding
- Quantitative trait loci (QTL) mapping in plant breeding
- Hybrid vigor breeding in plant breeding
- Backcross breeding in plant breeding
- Recurrent selection in plant breeding
- Multiline breeding in plant breeding
- Heterosis breeding in plant breeding
- Advanced generation breeding in plant breeding
- Genome-wide selection in plant breeding
- Phenotypic screening in plant breeding
- Molecular marker development in plant breeding
- Genetic diversity analysis in plant breeding
- Performance testing in plant breeding
- Trait introgression in plant breeding
- Advanced breeding lines in plant breeding
- Genetic variability assessment in plant breeding
- Progeny testing in plant breeding
- Breeding value estimation in plant breeding
- Population genetics in plant breeding
- Crop improvement strategies in plant breeding
- Genetic gain prediction in plant breeding
- Breeding program design in plant breeding
- Genetic uniformity in plant breeding
- Controlled pollination in plant breeding
- Genetic resource management in plant breeding
- Phenotypic plasticity in plant breeding
- Breeding efficiency in plant breeding
- Genetic complexity analysis in plant breeding
- Reproductive biology in plant breeding
- Crop adaptation mechanisms in plant breeding
- Statistical genetics in plant breeding
- Breeding methodology optimization in plant breeding
- Classical resistance breeding in plant breeding
- Genomic in situ hybridization (GISH) in plant breeding
- Haploidy induction in plant breeding
- Bi-parental mapping populations in plant breeding
- Multi-parental breeding approaches in plant breeding
- Doubled haploid technology in plant breeding
- Epistasis analysis in plant breeding
- Trait stacking in plant breeding
- Transcriptomics in plant breeding
- Isozyme markers in plant breeding
- Breeding for disease resistance in plant breeding
- DArT marker technology in plant breeding
- AFLP marker systems in plant breeding
- SNP genotyping in plant breeding
- Biochemical marker-assisted breeding in plant breeding
- Marker-based recurrent selection in plant breeding
- Breeding for yield stability in plant breeding
- Abiotic stress tolerance breeding in plant breeding
- Pre-breeding programs in plant breeding
- Crop modeling integration in plant breeding
- Germplasm enhancement in plant breeding
- Controlled environment breeding in plant breeding
- Speed breeding protocols in plant breeding
- High-yield variety development in plant breeding
- Disease-resistance introgression in plant breeding
- Fusarium wilt tolerance breeding in plant breeding
- High-oil content breeding in plant breeding
- MABC (Marker-Assisted Backcrossing) in plant breeding
- Field trial optimization in plant breeding
- Regional trait evaluation in plant breeding
- Statistical QTL analysis in plant breeding
- Breeding for micronutrient density in plant breeding
- Controlled hybrid seed production in plant breeding
- Genotypic adaptability analysis in plant breeding
Modern (Cutting-Edge) Plant Breeding (66 Keywords)
- CRISPR gene editing in plant breeding
- Genome sequencing in plant breeding
- Precision breeding in plant breeding
- Synthetic biology in plant breeding
- Gene network analysis in plant breeding
- Metabolic engineering in plant breeding
- Transgenic development in plant breeding
- Next-generation sequencing in plant breeding
- Epigenetic modification in plant breeding
- Machine learning in plant breeding
- Artificial intelligence prediction in plant breeding
- RNA interference in plant breeding
- Genome editing techniques in plant breeding
- Advanced phenotyping in plant breeding
- Computational breeding in plant breeding
- Genomic selection in plant breeding
- Metabolomics in plant breeding
- Proteomics applications in plant breeding
- Systems biology in plant breeding
- Nano-biotechnology in plant breeding
- Digital phenotyping in plant breeding
- Advanced genetic transformation in plant breeding
- Molecular breeding platforms in plant breeding
- Genome-wide association studies in plant breeding
- Synthetic genomics in plant breeding
- Predictive breeding models in plant breeding
- High-throughput screening in plant breeding
- Gene regulatory network in plant breeding
- Advanced trait mapping in plant breeding
- Computational genomics in plant breeding
- Climate-resilient breeding in plant breeding
- Precision agriculture technologies in plant breeding
- Adaptive breeding strategies in plant breeding
- CRISPR-Cas9 delivery systems in plant breeding
- Pangenomics in plant breeding
- Hologenomics in plant breeding
- Digital twin modeling in plant breeding
- Epigenome-wide association studies in plant breeding
- Single-cell sequencing in plant breeding
- Multi-omics integration in plant breeding
- Deep learning for trait prediction in plant breeding
- Blockchain for seed traceability in plant breeding
- Sensor-based phenotyping in plant breeding
- Quantum computing in plant breeding
- Virtual reality applications in plant breeding
- Synthetic promoter engineering in plant breeding
- Biostimulant-driven breeding in plant breeding
- Photosynthetic efficiency optimization in plant breeding
- AI-based genome annotation in plant breeding
- Cellular reprogramming in plant breeding
- Carbon sequestration traits breeding in plant breeding
- Eco-genomics in plant breeding
- Functional metagenomics in plant breeding
- Robotic phenotyping in plant breeding
- Non-coding RNA studies in plant breeding
- Microbiome-assisted breeding in plant breeding
- Advanced chloroplast transformation in plant breeding
- Pathogenomics applications in plant breeding
- Environmental DNA (eDNA) in plant breeding
- Autonomous breeding programs in plant breeding
- Thermo-tolerance breeding in plant breeding
- Urban crop development in plant breeding
- Biodegradable agro-inputs breeding in plant breeding
- Real-time molecular diagnostics in plant breeding
- Holistic systems breeding in plant breeding
- Predictive climate modeling in plant breeding
Steps Followed
Step 1: Data Collection
- Get a YouTube Data API v3 key. This allowed us to programmatically access YouTube data.
- Search Queries: Using above described keyword lists (keywords_old, keywords_current, keywords_modern) to construct search queries. It’s crucial to refine these queries to be as specific as possible. For example, instead of just “Mass selection,” try “mass selection in plant breeding” to get more relevant results.
The problem with YouTube Data API is that it has a search quota limit of 10000 per day
. Therefore, we have to run the script multiple times to get all the data. For this, we implemented a state file to track the progress of the script each time it is run. This file is stored in the youtube_data directory. Each time the script is run, it checks the state file to see if the data for the current keyword has already been fetched. If it has, it skips the keyword. If it hasn’t, it fetches the data for the current keyword. This way, we can continue from where we left off and avoid fetching the same data multiple times.
In this way, we fetched 50 videos for each keyword, which is 50 x 66 x 3 = 9900 videos. Why only 50 videos? π€ Because more than this we find that we mostly get irrelevant videos. As we are analyzing education content, for which videos are not that much available on YouTube.
Search request: 100 quota units Video details: 1 quota unit per video Γ 50 videos = 50 quota units Total per keyword: 100 + 50 = 150 quota units
Max keywords = DAILY_QUOTA_LIMIT / quota_per_keyword Max keywords = 10000 / 150 β 66 keywords per day
Max videos = 66 Γ 50 = 3,300 videos per day Days needed = Total keywords / Max keywords per day Days needed = 198 / 66 = 3 days
Therefore, we need to run the script 3 times to get all the data related to all 198 keywords stated in the keywords.csv file and store the data in the youtube_data/all_videos_data.csv file.
Used following resources to manage API request quotas:
- https://thepythoncode.com/article/using-youtube-api-in-python
- https://peerdh.com/blogs/programming-insights/managing-api-request-quotas-in-python?utm_source=chatgpt.com
- https://www.youtube.com/watch?v=TIZRskDMyA4
- https://realpython.com/api-integration-in-python/
Step 2: Data Cleaning & Preprocessing
- Remove duplicate entries to ensure data quality
- Convert published_date to datetime format
- Remove ‘Z’ suffix and convert to pandas datetime format. e.g. 2016-09-29T22:39:48Z
- Converted relevant columns to numeric format for analysis
- duration_seconds: Video length in seconds
- view_count: Number of views
- like_count: Number of likes
- comment_count: Number of comments
- Added human-readable duration format for better interpretation
- Removed extra whitespace and standardized titles
- Created engagement rate metric: Formula: (likes + comments) / views * 100 As inspired from the article
- Removed rows with all missing values to ensure data quality
- Sorted data by category and keyword for better analysis
- Saved cleaned dataset for further analysis
Step 3: Solution for the problem of Irrelevant Videos
We recognized that our data contains some videos that are not at all related to the plant breeding concepts even after the above mentioned data cleaning & preprocessing. This was a big trouble. π This let us to look for the solutions beyong the scope of the project. Therefore, we implemented a solution to remove the irrelevant videos from our data by using the following libraries:
- nltk: For natural language processing (e.g., tokenizing words in text).
- sentence-transformers: For encoding sentences into vector representations using BERT.
- sklearn: For calculating cosine similarity between vectors.
Inspired by these sources:
- https://www.restack.io/p/similarity-search-answer-relevance-scores-cat-ai?utm_source=chatgpt.com
- https://www.33rdsquare.com/four-of-the-easiest-and-most-effective-methods-of-keyword-extraction-from-a-single-text-using-python/?utm_source=chatgpt.com
- https://www.analyticsvidhya.com/blog/2022/08/movies-recommendation-system-using-python/?utm_source=chatgpt.com
To assess how relevant each video is to a given keyword and filter out irrelevant videos. The solution uses both basic text matching and advanced semantic similarity (cosine similarity) to calculate a βrelevance scoreβ for each video.
- Relevance Scoring:
- Each video title is compared to its corresponding search keyword using:
- Direct Keyword Match: Checks if the keyword is present in the title.
- Word Overlap: Measures how many words in the title overlap with the keyword.
- Sequence Similarity: Checks how similar the sequence of words in the title is to the keyword.
- These scores are combined with weights to calculate a base relevance score. - Direct keyword match (30% weight). - Word overlap (40% weight). - Sequence similarity (30% weight).
- Semantic Similarity:
- Uses Sentence BERT (SBERT) to compute the similarity between the title and keyword on a semantic level.
- The final relevance score combines the base relevance score (60%) and semantic similarity score (40%).
- Categorization:
- Videos are categorized based on their relevance scores:
- High: Scores β₯ 40 (as average relevance score of the data was near 40)
- Medium: Scores between 20 and 40 (half of average relevance score)
- Low: Scores between 10 and 20 (quarter of average relevance score)
- Not Relevant: Scores < 10 (This was found from the data that any video below this score was not relevant to the keyword at all)
- Filter and Save:
- Videos with relevance scores > 10 are considered relevant and saved to a new file.
Step 4: Exploratory Data Analysis
Following libraries are used for data analysis:
- pandas: For data manipulation and analysis
- numpy: For numerical operations
- matplotlib: For creating static, interactive, and animated visualizations
- seaborn: For creating statistical data visualizations
- plotly: For creating interactive web-based visualizations
Key Findings
Demand Indicators:
- Top Keywords:
- Keywords like
Modern Breeding Techniques
,Hybrid Varieties
, andCRISPR in Plants
have the highest Opportunity Scores, indicating high demand and strong engagement potential.
- Keywords like
- Engagement Metrics:
- Keywords under the
Modern
category show the highest average Engagement Rate (5.6%), whileOld
technologies demonstrate steady, albeit lower, audience interest.
- Keywords under the
- Top Keywords:
Category-Level Insights:
- Modern:
- This category has the highest Opportunity Score due to topics like genetic editing and sustainable practices gaining popularity.
- Total Views: 3.2M; Engagement Rate: 5.6%
- Current:
- Focused on trending topics like genome sequencing, offering moderate engagement and competitive opportunities.
- Total Views: 2.8M; Engagement Rate: 4.3%
- Old:
- Historical breeding methods still attract niche audiences, but lower engagement suggests less dynamic interest.
- Total Views: 1.9M; Engagement Rate: 3.2%
- Modern:
Visualization Insights:
- Treemap Analysis:
- Keywords under the
Modern
category dominate with higher Opportunity Scores.
- Keywords under the
- Bubble Chart:
- Keywords with high views and engagement (e.g.,
CRISPR
,Sustainable Practices
) present ideal starting points for content creation.
- Keywords with high views and engagement (e.g.,
- Treemap Analysis:
Relevance of Such Analysis
- Identifying Content Gaps:
- The analysis reveals specific keywords with high demand but relatively low supply in terms of detailed content. Targeting these gaps can attract substantial viewership.
- Strategic Content Focus:
- Keywords with high Engagement Rates and Opportunity Scores (e.g.,
Modern Technologies
) indicate topics where audiences are actively interested and engaged. - Avoid oversaturated or underperforming keywords.
- Keywords with high Engagement Rates and Opportunity Scores (e.g.,
- Audience Preferences:
- Insights on engagement trends guide the style and depth of content creation (e.g., informative, visual-heavy, or interview formats).
Actionable Recommendations
- Focus Areas:
- Begin with
Modern
topics like CRISPR and Hybrid Breeding, as they combine high audience interest and engagement.
- Begin with
- Content Strategy:
- Use engaging visuals and simplified explanations for complex topics to increase accessibility.
- Consider a series format to explore different aspects of modern plant breeding technologies.
- Niche Selection:
- Target subdomains like
Genetic Editing
,Genome Sequencing
, andSustainable Breeding Practices
, which show consistent demand.
- Target subdomains like
- Content Frequency:
- Publish frequently on trending topics to maintain relevance, especially under the
Current
category.
- Publish frequently on trending topics to maintain relevance, especially under the
- SEO Optimization:
- Optimize video titles and descriptions with high-ranking keywords like
CRISPR
,Hybrid Plants
, andGenome Editing
.
- Optimize video titles and descriptions with high-ranking keywords like
Future Work
- Trend Monitoring:
- Continuously analyze evolving audience interests using new data over time to stay ahead of trends.
- Deep Dive into Audience Segmentation:
- Identify demographic and regional preferences for specific keywords to tailor content.
- Performance Tracking:
- Regularly monitor engagement metrics (views, likes, comments) to refine the content strategy.
- Collaborative Content:
- Collaborate with experts or influencers in the plant breeding domain to leverage their audience and add credibility.
- Advanced Analytics:
- Utilize predictive modeling to forecast future trends in plant breeding content on YouTube.
Project Setup
1. Clone the repository:
git clone <repository-url>
cd <repository-name>
2. Install required packages:
pip install -r requirements.txt
3. Run the Application
- Install dependencies:
pip install -r requirements.txt
- Run the application:
streamlit run dashboard.py
To run a fresh analysis setup the API Configuration
Get a YouTube Data API key from the Google Cloud Console Set up environment variables:
- Copy
.env.example
to.env
:cp .env.example .env
- Edit
.env
and add your YouTube API key:YOUTUBE_API_KEY=your_actual_api_key_here
Project Structure
βββ README.md βββ requirements.txt βββ .env.example βββ .env # (git-ignored) βββ .gitignore βββ youtube_data_fetcher.py βββ dashboard.py βββ youtube_data/ βββ all_videos_data.csv