HydraHead cuts transformer compute 40% by mixing attention types within layers

·1 views

Researchers have developed HydraHead, a technique that blends full attention and linear attention at the individual head level within transformer models, rather than swapping entire layers. The method reserves costly quadratic full-attention computation for just 25% of heads, while the remaining 75% use a cheaper linear module called GDN. Despite this aggressive reduction, HydraHead matches the benchmark performance of conventional 3:1 layer-wise hybrid models, even at linear-to-full head ratios as high as 7:1. The approach was evaluated on long-context reading and reasoning tasks after training on 15 billion tokens, and could reduce attention-related FLOPs by roughly 40%. If the gains hold broadly, the technique could enable larger context windows or allow bigger models to run on lower-end hardware.

Read the full story at DEV Community

This is an AI-generated summary. ShortSingh links to the original source for the complete article.

Discussion (0)

System Design Interviews: Why Framework Matters More Than Tool Knowledge

A tutorial published on DEV Community uses a fictional uncle-nephew dialogue to explain why many experienced engineers struggle with system design interview questions. The core argument is that failing candidates suffer from a framework problem rather than a knowledge gap — they jump to technology choices before understanding the problem. The guide proposes a 12-step methodology grouped into three phases: Understand, Design, and Robustify, meant to be applied consistently across any system design prompt. The approach emphasizes asking clarifying questions first, estimating scale before selecting tools, and separating functional from non-functional requirements. The author contends that mastering this fixed sequence allows engineers to tackle any 'design X' question with structure and confidence.

0 comments Read more at DEV Community

ProgrammingDEV Community ·

Dev builds browser-only toolkit after accidentally exposing production credentials online

A developer built a suite of privacy-focused tools after realising they had unknowingly sent production database credentials to an unknown third-party server via an online .env converter. Investigating other commonly used tools revealed a similar pattern: thin frontends masking backend processing with no transparency about data retention. The resulting toolkit, available at configdev.com, includes an env converter, crontab-to-systemd converter, CIDR calculator, PII log scrubber, and CSV-to-JSON Schema builder. All processing runs entirely in the browser, meaning no data is transmitted to external servers, which users can verify by checking the network tab or going offline mid-session. The project is in its early stages with few users so far, but the developer has made it publicly available and is open to feedback.

0 comments Read more at DEV Community

ProgrammingDEV Community ·

AI Chatbots Are Sending Paying Customers to Businesses That Can't Track Them

An Israeli AI-automation agency, Automaziot AI, discovered that AI assistants like ChatGPT and Perplexity had been quietly referring customers to their business since mid-May 2026, generating at least nine tracked web leads plus additional phone inquiries. Two of those leads converted into paying clients worth a combined ₪35,000 (roughly $9,300), yet the company's CRM had misclassified nearly all of them as 'website' or 'unknown' traffic. A key example involved a window-cleaning business owner who phoned the agency after an AI assistant recommended them, closed a deal, and paid — all within the same day, leaving no digital attribution trail. Standard CRM attribution systems, built around paid-click identifiers and form submissions, are structurally unable to capture referrals that originate from AI assistants, especially when the next step is a phone call. The agency found that AI-referred leads also showed the highest inbound-to-outbound message engagement ratio of any acquisition source in their CRM, suggesting meaningful buyer intent rather than casual browsing.

0 comments Read more at DEV Community

ProgrammingDEV Community ·

Linux DMA Mapping API: How Coherent and Streaming Mappings Differ

The Linux DMA mapping API helps driver authors translate CPU buffer addresses into bus addresses usable by devices, while also handling cache maintenance on non-coherent architectures. Two primary mapping types exist: coherent mappings for small, long-lived control structures that require no explicit syncing, and streaming mappings for bulk data transfers that must be explicitly synced if the CPU accesses the buffer mid-transfer. A key challenge is that DMA spans three distinct address spaces — kernel virtual, CPU physical, and device bus addresses — which are not interchangeable and cannot be used in place of one another. On non-coherent embedded SoCs, incorrect cache handling can cause data corruption that appears only on ARM targets but not on x86 systems, making bugs notoriously difficult to diagnose. The API abstracts these architecture-specific cache operations, and tools like CONFIG_DMA_API_DEBUG can help validate correct usage of map and unmap calls in driver code.

0 comments Read more at DEV Community