Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Brite
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (Darkly)
  • No Skin
Collapse

The Squatting Pirates

  1. Home
  2. Linux
  3. Wulfs corner for hosting an LLM!

Wulfs corner for hosting an LLM!

Scheduled Pinned Locked Moved Linux
2 Posts 1 Posters 1 Views
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • W Offline
    W Offline
    wulf7685
    wrote last edited by
    #1

    Step 1: acquire harware...

    I have made a list of hardware I would like to use. I'm going to try to keep on track for it but largely things are going to change due to price fluctuations as well as looking at local markets (FB market place, etc). I'm going to try to host Qwen3.5-35B_A3B on nobara.

    Qwen3.5-35B_A3B specs per grok:

    Total parameters: 35 billion.
    Activated parameters per token: ~3 billion (hence the "A3B" suffix). This comes from a sparse MoE setup with 256 experts total, of which only 8 routed + 1 shared expert activate per forward pass (roughly 3.5% activation rate).
    Layers: 40 total.
    Hidden dimension: 2048.
    Vocabulary size: 248,320 (padded).
    Hybrid layout (repeated 10 times): 3 × (Gated DeltaNet → MoE) followed by 1 × (Gated Attention → MoE). This mixes linear attention (for efficiency) with occasional full softmax attention.

    Gated DeltaNet details (linear attention component):32 heads for V, 16 for QK.
    Head dimension: 128.

    Gated Attention details (the "full" attention layers):16 heads for Q, 2 for KV (grouped-query style for efficiency).
    Head dimension: 256.
    Rotary Position Embedding (RoPE) dimension: 64.

    MoE details:Expert intermediate dimension: 512.
    This hybrid design (Gated DeltaNet for most layers + sparse MoE FFN blocks) enables high throughput with low latency and minimal KV-cache overhead compared to pure Transformer models.

    huggingface.co

    The result is a model that behaves like a much denser/larger one during inference but only "pays" for ~3B active parameters. It often outperforms the prior flagship Qwen3-235B-A22B (dense or MoE variants) on many tasks while being dramatically cheaper and faster to run.Multimodal CapabilitiesIt is a unified vision-language model trained with early-fusion on multimodal tokens. It natively supports:Text.
    Images.
    Video.

    Strengths include:Visual reasoning.
    Spatial grounding.
    Document/UI analysis.
    OCR.
    Image captioning.
    GUI interaction.

    It outperforms the previous Qwen3-VL series in these areas and achieves cross-generational parity (or better) with larger Qwen3 models on multimodal benchmarks. It handles tool use and agentic workflows that combine vision + reasoning + code.

    lmstudio.ai

    Context and EfficiencyNative context length: 262,144 tokens (~393 A4 pages of text).
    Extendable to ~1 million tokens via RoPE/YARN scaling.
    Strong context scaling: throughput degrades very little even at long contexts (e.g., only ~0.9% drop in some tests from 512 to 8k+ tokens).
    Supports up to ~65k output tokens in many deployments.

    Inference speed is a major selling point:On consumer hardware (e.g., quantized Q4/Q5 or FP8):RTX 3090 (~$800 used): ~110–112 tokens/second at full 262k context.
    RTX 4090: ~122 tokens/second.
    RTX 5090: up to ~170 tokens/second.
    Apple Silicon (M1/M4 Macs, 24–64GB unified memory): 15–37+ tokens/second depending on quantization and prompt length.
    High-end servers (H100/H200): hundreds to over 1,000 tokens/second in batched/optimized setups.

    Quantized versions (e.g., 4-bit, FP8, GGUF) fit comfortably on 24GB VRAM cards or even lower with offloading. Some reports show it running on as little as 8–19GB VRAM effectively.

    agentnativedev.medium.com

    It is designed for production: low latency, high throughput, and suitable for local/agentic use cases (many users report it replacing multi-model setups for coding + reasoning workflows).Performance HighlightsThe model shines in efficiency-vs-performance tradeoffs. It delivers:Reasoning and knowledge: Competitive with or better than much larger models. Examples from benchmarks:MMLU-Pro: ~85.3%.
    GPQA Diamond: ~84.2%.
    Strong on math (e.g., HMMT), multilingual (C-Eval), and general knowledge.

    Coding and agentic tasks: Excellent at software engineering, tool use, and multi-step agentic workflows (planning + coding + analysis). Users frequently call it a "gamechanger" for local coding agents, often handling full tasks (data analysis + code + insights) in a single model where previously two were needed.
    Multimodal: Improved over prior VL models in document understanding, visual reasoning, and GUI tasks.
    Overall intelligence indices place it high among open models of its class, sometimes approaching or rivaling closed models like Claude Sonnet variants or GPT-5-mini in specific domains (reasoning, coding, agents).

    It supports 201 languages and dialects with nuanced cultural understanding. Post-training includes scaled reinforcement learning (RL) across complex, progressively harder tasks for better real-world generalization.Other Practical DetailsLicense: Apache 2.0 (fully open weights; base and instruction-tuned variants available on Hugging Face).
    Deployment: Works with Hugging Face Transformers, vLLM, llama.cpp, Ollama, LM Studio, MLX (Apple), etc. FP8 and various GGUF quants are popular.
    Hosted versions: Powers things like Qwen3.5-Flash on Alibaba Cloud (fast/cheap tier).
    Uncensored variants: Community fine-tunes exist for minimal refusals.
    Limitations (as with any model): Still has occasional hallucinations on edge cases; multimodal performance depends on input quality; very long contexts can still tax lower-end hardware in prefill phase.

    In short, Qwen3.5-35B-A3B is one of the best examples yet of how architectural innovations (hybrid linear attention + sparse MoE + high-quality training/RL) can break traditional scaling expectations. It gives frontier-like capabilities (reasoning, coding, vision, agents) at the inference cost of a much smaller dense model, making it highly practical for local runs, agentic systems, and cost-sensitive production use. Many in the community view it as a turning point for accessible high-intelligence local AI.

    1 Reply Last reply
    0
    • W Offline
      W Offline
      wulf7685
      wrote last edited by
      #2
      This post is deleted!
      1 Reply Last reply
      0

      Hello! It looks like you're interested in this conversation, but you don't have an account yet.

      Getting fed up of having to scroll through the same posts each visit? When you register for an account, you'll always come back to exactly where you were before, and choose to be notified of new replies (either via email, or push notification). You'll also be able to save bookmarks and upvote posts to show your appreciation to other community members.

      With your input, this post could be even better 💗

      Register Login
      Reply
      • Reply as topic
      Log in to reply
      • Oldest to Newest
      • Newest to Oldest
      • Most Votes


      • Login

      • Don't have an account? Register

      • Login or register to search.
      • First post
        Last post
      0
      • Categories
      • Recent
      • Tags
      • Popular
      • World
      • Users
      • Groups