Skip to content

bugfix: extract numbered topic items within interest section#19

Open
RonghaiHe wants to merge 1 commit into
AI45Lab:mainfrom
RonghaiHe:bugfix/queries_extraction
Open

bugfix: extract numbered topic items within interest section#19
RonghaiHe wants to merge 1 commit into
AI45Lab:mainfrom
RonghaiHe:bugfix/queries_extraction

Conversation

@RonghaiHe

Copy link
Copy Markdown

Similar to Pull Request: extract numbered topic items within interest section as queries of semantic scholar. The difference is that current version is based on the latest commit: 9aa7efb

Have tested via

python main.py --sources semanticscholar --save --skip_source_emails

…queries_from_description

Co-Authored-By: OpenCode <opencode-agent[bot]@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 5, 2026 09:56

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Updates the Semantic Scholar source query derivation to parse “interest” sections and numbered topic lists from a free-form research description.

Changes:

  • Reworks _derive_queries_from_description to use a section-based state machine (in_interest_section) rather than per-line prefix stripping.
  • Extracts topics primarily from numbered items inside an “interest/关注/研究” section, with additional cleanup around common separators.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +64 to +66
in_interest_section = False

for line in desc.split("\n"):
Comment on lines +77 to +80
# Lines containing "interest" signal the start of the interest section
if re.search(r'\binterest', lower) or "关注" in lower or "研究" in lower:
in_interest_section = True
continue # header line itself is not a topic
Comment on lines +64 to +66
in_interest_section = False

for line in desc.split("\n"):
Comment on lines +82 to +86
# Within the interest section, extract from numbered items
if in_interest_section:
m = re.match(r'^\d+[\.\)\-:、]\s*(.*)', line)
if m:
content = m.group(1)
Comment on lines +91 to 94
if content and len(content) > 1:
queries.append(content[:120])

return queries or ["artificial intelligence"]
def _derive_queries_from_description(self) -> list[str]:
"""Extract up to 3 search queries from the user description."""
"""Extract search queries from the user's research description."""
import re
lower = line.lower()

# "not interested" signals the end of the interest section
if any(neg in lower for neg in ("not interested", "不感兴趣", "don't", "exclude")):
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants