Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Export only users participating or part of exported channels or conversations #287

Open
cdeszaq opened this issue May 6, 2024 · 7 comments

Comments

@cdeszaq
Copy link

cdeszaq commented May 6, 2024

Is your feature request related to a problem? Please describe.

My Slack has many more users than I care about, since I only care about a subset of conversations. Exporting all of them is a large waste of time and resources.

Describe the solution you'd like

I would like an option to build up the user information / cache as the desired channels and conversations are exported or dumped, rather than either not cache the information or not export the user information.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

  1. Simply not having user information - Just userIDs has a certain "obfuscation" appeal, but still greatly reduces the usability of the export
  2. Manually maintaining the user info file - Not appealing because there are still many users I do care about, and they get added over time, so manual maintenance would be burdensome
  3. A way to update an existing user cache, rather than replace it - This would probably work, though given a large-enough set of users could still be burdensome

Additional context

None

@rusq
Copy link
Owner

rusq commented May 6, 2024

Hey @cdeszaq, if you can be bothered, try building the v3 from the master branch, it has some major improvements. There's no documentation (you can try reading the man page - in the root of the project type man ./slackdump.1, or refer to this comment which I left in another issue: #273 (comment)

The v3 branch was merged into master, so ignore the comment saying "checkout the v3 branch".

@cdeszaq
Copy link
Author

cdeszaq commented May 6, 2024

Ahh, I see! I'll happily play w/ v3 and report back. I'm currently doing so with a checkout of master, in fact, with this command:

./slackdump export -enterprise -cache-dir /Users/cdeszaq/playground/slackdump/cache -o /Users/cdeszaq/playground/slackdump/exportTest4 -v <chanID1> <chanID2>

(the cache directory doesn't see to be getting used, but that's not so much "functionality" related, so I don't mind)

But, after downloading the channel messages and files it seems to be trying to also download all the users. I've not dug into the man pages yet, and haven't dug closely through the linked issue, but I'll try those things next.

Otherwise, any pointers to what I'm missing or doing wrong in my incantation above would help as well I'm sure.

@cdeszaq
Copy link
Author

cdeszaq commented May 6, 2024

From the head of master, with a vanilla archive command limited to 2 channels for an enterprise slack, I'm still seeing what looks like "download all the users" behavior, though the program simply appears to hang based on the terminal output because I stop getting output (in non-verbose mode) after it has finished downloading the threads. In verbose mode (with the -v flag) I see a regular march of output at least showing that something is happening.

Command:

./slackdump archive -enterprise  <chanID1> <chanID2>

-v trailing output:

archive: 2024/05/06 17:00:04.897276 network.go:136: success
archive: 2024/05/06 17:00:04.912997 network.go:136: maxAttempts=20
archive: 2024/05/06 17:00:06.438011 network.go:136: success
archive: 2024/05/06 17:00:06.452961 network.go:136: maxAttempts=20
archive: 2024/05/06 17:00:07.920001 network.go:136: success
archive: 2024/05/06 17:00:07.926486 network.go:136: maxAttempts=20
archive: 2024/05/06 17:00:08.971212 network.go:136: WithRetry: slack rate limit exceeded, retry after 30s (*slack.RateLimitedError) after 1 attempts
archive: 2024/05/06 17:00:08.971262 network.go:136: got rate limited, sleeping 30s
archive: 2024/05/06 17:00:39.665184 network.go:136: success

That said, I do like the default output directory name (simply stamping the directory name with the execution date/time saves me a step I inevitably would do!)

@rusq
Copy link
Owner

rusq commented May 6, 2024

That's great! Just checked, yes, unfortunately you can't skip downloading all the users, this is something that I'll look into, I can't promise that I'll be able to do it in the nearest future due to life circumstances, but it will be on my list when I get back to it. I hope the v3 is better suited to what you're trying to do and you find it more pleasant to use than v2, but it would be great if you could report any issues.

Regarding the rate limit - there were instances where slack would profoundly restrict user endpoint, you can see the exact endpoint that was limited, if you enable tracing ./slackdump archive -enterprise -v -trace=trace.out <chanID1> ..., and then run go tool trace trace.out on it, it will be in the User defined tasks and User defined regions. Luckily seems like the retry recovery worked?

@cdeszaq
Copy link
Author

cdeszaq commented May 6, 2024

Retry recovery seems to work, as it is cranking through the users regardless of being rate-limited.

The newer version indeed seems much nicer. I've yet to try the -member-only option, which is another key desire I have. I'm not sure if it will work on archive or only on export (nor am I very clear on the differences between the two), but I'll play with it.

Any pointers on where/how to start hacking on a flag to "queue users to download when encountered" (and only download those users) would be useful. I may take a whack at hacking it in myself! :-)

@rusq
Copy link
Owner

rusq commented May 6, 2024

I will have to do a bit of diving here to explain.

V3 has a concept of "chunks" - it is a centralised format that represents the "chunk" of api output, so each endpoint call maps to one "chunk type", i.e. WorkspaceInfo, User, ChannelInfo, Messages, ThreadMessages etc. Chunk structure, if you look at it, is universal, and contains all possible payload types that could be grabbed from the API endpoints. Depending on the API call, the respective chunk type is set, and the relevant payload member variables are populated, and then the structure is Marshalled into the Writer. One could call it a "native slackdump format", because internally, in v3, everything goes through the chunk format.

"archive" creates a "recording" of the API output, that can even be "replayed" later to mock the actual SlackAPI output. It can be converted to "export" format later, if required with slackdump convert.

"export" actually is generating chunk files in temporary directory, and then converts it to "export" format "on the fly" to the destination directory. The same happens when you run "dump".

  • "archive" creates a bunch of gzipped JSONL files in the directory
  • "export" creates a bunch of JSON files in the Slack Export format that could (potentially, untested yet) be loaded into another slack instance.
  • "convert" can be used to convert from "archive" to "export" formats (but not the other way around)
  • "view" can be used to browse through the messages in the browser, it "understands" all three possible output formats - archive, export and dump.

@cdeszaq
Copy link
Author

cdeszaq commented May 7, 2024

To give a sense of the amount of time my Slack's user-data download takes (and the motivation for this overall feature request): that user portion of the archive (of 2 small channels) took more than 10 hours to retrieve more than 4 GB of data (uncompressed). The channel contents themselves (with a few files) took 9 seconds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants