Skip to content

Commit

Permalink
Add functionality to fetch and save PubMed records by year
Browse files Browse the repository at this point in the history
This commit introduces significant enhancements to the PubMedDownloader class, enabling the automated fetching of PubMed records by topic and year using the NCBI E-utilities API. Each set of records fetched for a specific year is now saved in a dedicated text file formatted in MEDLINE style, facilitating easier access and organization of the data.

Key Changes:
- Added file saving functionality that organizes records into files named by topic and year.
- Implemented error handling with retry logic and exponential backoff to manage network and API errors more robustly.
- Configured the fetch function to retrieve records in MEDLINE format, ensuring that the data is structured according to PubMed's bibliographic standards.

The records are stored in the './results/baseline_doc' directory, with each file representing a specific year's data on the chosen topic. This update is crucial for researchers needing structured and easily accessible bibliographic information from PubMed.
  • Loading branch information
lrm22005 committed May 6, 2024
1 parent 77d0ba9 commit fe91786
Showing 1 changed file with 24 additions and 0 deletions.
24 changes: 24 additions & 0 deletions code/step_1_data_collection_Luis.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,27 @@
"""
Code created by: lrmercadod
Date: 5/6/2024 10:43:45
PubMed Record Fetcher and Saver
This script is designed to automate the retrieval of PubMed records based on a specific topic and year. It uses the NCBI E-utilities API to fetch data in MEDLINE format and saves each year's data in a separate text file within a structured directory.
Features:
- Fetches PubMed records using a combination of the topic and year to form a query.
- Retrieves data in MEDLINE format, which includes structured bibliographic information.
- Saves the fetched data into text files, organizing them by topic and year under the './results/baseline_doc' directory.
- Handles network and API request errors by implementing retry logic with exponential backoff.
Usage:
- The user must provide an NCBI API key and email for using NCBI's E-utilities.
- Modify the 'topic' variable and the year range in the script to fetch records for different topics or years.
Dependencies:
- BioPython for interacting with NCBI's E-utilities.
- requests for making HTTP requests.
Example:
To use the script, simply run it in a Python environment with the necessary dependencies installed. Ensure that the API key and email are correctly set up in the script.
"""
import requests
from Bio import Entrez
from io import StringIO
Expand Down

0 comments on commit fe91786

Please sign in to comment.